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Preface 


This book is for people interested in feature selection research. Feature se- 
lection is an essential technique for dimensionality reduction and relevance 
detection. In advanced data mining software packages, such as SAS Enter- 
priser Miner, SPSS Modeler, Weka, Spider, Orange, and scikits.learn, feature 
selection procedures are indispensable components for successful data min- 
ing applications. The rapid advance of computer-based high-throughput tech- 
niques provides unparalleled opportunities for humans to expand capabilities 
in production, services, communications, and research. Meanwhile, immense 
quantities of high-dimensional data keep on accumulating, thus challenging 
and stimulating the development of feature selection research in two major 
directions. One trend is to improve and expand the existing techniques to 
meet new challenges, and the other is to develop brand new techniques di- 
rectly targeting the arising challenges. 

In this book, we introduce a novel feature selection technique, spectral fea- 
ture selection, which forms a general platform for studying existing feature 
selection algorithms as well as developing novel algorithms for new problems 
arising from real-world applications. Spectral feature selection is a unified 
framework for supervised, unsupervised and semi-supervised feature selection. 
With its great generalizability, it includes many existing successful feature 
selection algorithms as its special cases, allowing the joint study of these al- 
gorithms to achieve better understanding and gain interesting insights. Based 
on spectral feature selection, families of novel feature selection algorithms can 
also be designed to address new challenges, such as handling feature redun- 
dancy, processing very large-scale data sets, and utilizing various types of 
knowledge to achieve multi-source feature selection. 

With the steady and speedy development of feature selection research, we 
sincerely hope that this book presents a distinctive contribution to feature 
selection research, and inspires new developments in feature selection. We 
have no doubt what feature selection can impact on the processing of massive, 
high-dimensional data with complex structure in the near future. We are truly 
optimistic that in another 10 years when we look back, we will be humbled 
by the accreted power of feature selection, and by its indelible contributions 
to machine learning, data mining, and many real-world applications. 


xi 


xii Preface 


The Audience 


This book is written for students, researchers, instructors, scientists, and 
engineers who use or want to apply feature selection technique in their research 
or real-world applications. It can be used by practitioners in data mining, 
exploratory data analysis, bioinformatics, statistics, and computer sciences, 
and researchers, software engineers, and product managers in the information 
and analytics industries. 

The only background required of the reader is some basic knowledge of 
linear algebra, probability theory, and convex optimization. A reader can ac- 
quire the essential ideas and important concepts with limited knowledge of 
probability and convex optimization. Prior experience with feature selection 
techniques is not required as a reader can find all needed material in the text. 
Any exposure to data mining challenges can help the reader appreciate the 
power and impact of feature selection in real-world applications. 


Additional Resource 


The material in the book is complemented by an online resource at 
http: //dmml.asu.edu/sfs. 
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Chapter 1 


Data of High Dimensionality and 
Challenges 


Data mining is a multidisciplinary methodology for extracting nuggets of 
knowledge from data. It is an iterative process that generates predictive and 
descriptive models for uncovering previously unknown trends and patterns via 
analyzing vast amounts of data from various sources. As a powerful tool, the 
data mining technology has been used in a wide range of profiling practices, 
such as marketing, decision-making support, fraud detection, and scientific 
discovery, etc. In the past 20 years, the dimensionality of the data sets in- 
volved in data mining applications has increased dramatically. Figure 1.1 plots 
the dimensionality of the data sets posted in the UC Irvine Machine Learning 
Repository [53] from 1987 to 2010. We can observe that in the 1980s, the max- 
imal dimensionality of the data is only about 100; in the 1990s, this number 
increases to more than 1500; and in the 2000s, it further increases to about 
3 millon. The trend line in the figure is obtained by fitting an exponential 
function on the data. Since the y-axis is in logarithm, it shows the increasing 
trend of the dimensionality of the data sets is exponential. 

Data sets with very high (>10,000) dimensionality are quite common nowa- 
days in data mining applications. Figure 1.2 shows three types of data that 
are usually of very high dimensionality. With a large text corpus, using the 
bag-of-words representation [49], the extracted text data may contain tens of 
thousands of terms. In genetic analysis, a cDNA-microarray data [88] may 
contain the expression of over 30,000 DNA oligonucleotide probes. And in 
medical image processing, a 3D magnetic resonance imaging (MRI) [23] data 
may contain the gray level of several million pixels. In certain data mining ap- 
plications, involved data sets are usually of high dimensionality, for instance, 
text analysis, image analysis, signal processing, genomics and proteomics anal- 
ysis, and sensor data processing, to name a few. 

The proliferation of high-dimensional data within many domains poses 
unprecedented challenges to data mining [71]. First, with over thousands of 
features, the hypothesis space becomes huge, which allows learning algorithms 
to create complex models and overfit the data [72]. In this situation, the 
performance of learning algorithms likely degenerates. Second, with a large 
number of features in the learning model, it will be very difficult for us to 
understand the model and extract useful knowledge from it. In this case, the 
interpretability of a learning model decreases. Third, with a huge number of 
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FIGURE 1.1: The dimensionality of the data sets in the UC Irvine Machine 
Learning Repository. The x-axis is for time in year and the y-axis is for di- 
mensionality. The y-axis is logarithmic. It shows an exponentially increasing 
trend of data dimensionality over time. 


(a) text data (b) genetic data (c) medical image data 


FIGURE 1.2: Text data, genetic data, and image data are usually of high 
dimensionality. 
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features, the speed of a learning algorithm slows down and their computational 


efficiency declines. Below is an example that shows the impact of the dat 
dimensionality on learning performance. 


a 


Example 1 Impact of data dimensionality on learning performance 


When data dimensionality is high, many of the features can be irrelevant 
or redundant. These features can have negative effect on learning models, 
and decrease the performance of learning models significantly. 

To show this effect, we generate a two-dimensional data set with three 
classes, whose distribution is shown in Figure 1.3. We also generate different 
numbers of irrelevant features and add these features to the data set. We 
then apply a k nearest neighbor classifier (k-nn, k=3) with 10-fold cross- 
validation on the original data set as well as the data sets with irrelevant 
features. The obtained accuracy rates are reported in Figure 1.4(a). We can 
observe that on the original data set, the k-nn classifier is able to achieve 
an accuracy rate of 0.99. When more irrelevant feature are added to the 
original data set, its accuracy decreases. When 500 irrelevant features are 
added, the accuracy of k-nn declines to 0.52. Figure 1.4(b) shows the com- 
putation time used by k-nn when different numbers of irrelevant features 
are added to the original data. We can see when more features present in 
the data, both the accuracy and the efficiency of the k-nn decrease. This 
phenomenon is also known as the curse of dimensionality, which refers 
to the fact that many learning problems become less tractable as feature 
number increases [72]. 


1.1 Dimensionality Reduction Techniques 


In data mining applications with high-dimensional data, dimensionality 


reduction techniques [107] can be applied to reduce the dimensionality of the 
original data and improve learning performance. By removing the irrelevant 
and redundant features in the data, or by effectively combining original fea- 
tures to generate a smaller set of features with more discriminant power, di- 
mensionality reduction techniques bring the immediate effects of speeding up 
data mining algorithms, improving performance, and enhancing model com- 


prehensibility. Different types of dimensionality reduction techniques generally 


fall into two categories: feature selection and feature extraction. 


Figure 1.5 shows the general idea of how feature selection and feature ex- 


traction work. Given a large number of features, many of these features may 


be irrelevant or redundant. Feature selection achieves dimensionality reduc- 
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FIGURE 1.3: A two-dimensional data, set of three different classes. 
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FIGURE 1.4: Accuracy (a) and computation time (b) of k nearest neighbor 
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the data. 
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tion by removing these irrelevant and redundant features. To achieve this, a 
feature evaluation criterion is used with a search strategy to identify the rel- 
evant features. And a selection matrix W is used to filter the original data 
set and generate a reduced data set containing only the relevant features.! 
Unlike feature selection, feature extraction achieves dimensionality reduction 
by combining the original features with a weight matrix W’ to generate a 
smaller set of new features.? In the combination process, the irrelevant and 
redundant features usually receive zero or very small coefficients, therefore 
have less influence on the newly generated features. One key difference be- 
tween feature selection and feature extraction is that the data set generated 
by feature selection contains the original features, while the data set generated 
by feature extraction contains a set of newly generated features. 

Feature selection and feature extraction each have their own merits. Fea- 
ture selection is able to remove irrelevant features and is widely used in data 
mining applications, such as text mining, genetics analysis, and sensor data 
processing. Since feature selection keeps the original features, it is especially 
applicable in applications where the original features are important for model 
interpreting and knowledge extraction. For instance, in genetic analysis for 
cancer study, our purpose is not only to distinguish the cancerous tissues 
from the normal ones, but also to identify the genes that induce canceroge- 
nesis. Identifying these genes helps us acquire a better understanding on the 
biological process of cancerogenesis, and allows us to develop better treatments 
to cure the disease. 

By combining the original features, feature extraction techniques are able 
to generate a set of new features, which is usually more compact and of 
stronger discriminating power. It is preferable in applications such as image 
analysis, signal processing, and information retrieval, where model accuracy 
is more important than model interpretability. 

The two types of dimensionality reduction techniques have different 
strengths and are complementary. In data mining applications, it is often 
beneficial to combine the two types of techniques. For example, in text min- 
ing, we usually apply feature selection as the first step to remove irrelevant 
features, and then use feature extraction techniques, such as Latent Semantic 
Indexing (LSI) [100], to further reduce dimensionality by generating a small 
set of new features via combining original features. 

In this book, we will present a unique feature selection technique called 
spectral feature selection. The technique measures feature relevance by con- 
ducting spectral analysis. Spectral feature selection forms a very general 
framework that unifies existing feature selection algorithms, as well as var- 
ious feature extraction techniques. It provides a platform that allows for the 
joint study of a variety of dimensionality reduction techniques, and helps us 
achieve a better understanding on them. Based on the spectral feature se- 


lThe element of a selection matrix is either 0 or 1. More details about the selection 
matrix will be discussed in Section 1.2.1. 
2The element of a weight matrix can be any real number. 


Data of High Dimensionality and Challenges 7 


[ a relevant feature] | a relevant feature [ a relevant feature] 


k 
select i 
features 


reduced data 


m features 


n instances 


original data selection matrix with original 
, features 
(a) feature selection 
m features k' 
0 
ı Q 
1 O 
IS 
1 & EX 
I + 
1 0 
re 
I A combine 
i features 
i poa reduced data 
original data weight matrix with new 
i features 


(b) feature extraction 


FIGURE 1.5: A comparison of feature selection (a) and feature extraction 
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lection framework, we can also design novel feature selection algorithms to 
address new problems, such as handling large-scale data and incorporating 
multiple types of knowledge in feature selection, which cannot be effectively 
addressed by using existing techniques. Below, we start with a brief introduc- 
tion to the basic concepts of feature selection. 


1.2 Feature Selection for Data Mining 


Feature selection [108, 109] in data mining has been an active research 
area for decades. The technique has been applied in a variety of fields, in- 
cluding genomic analysis [80], text mining [52], image retrieval [60, 180], and 
intrusion detection [102] to name a few. Recently, there have been several 
good surveys published that systematically summarize and compare existing 
works on feature selection to facilitate the research and the application of the 
technique. A comprehensive survey of existing feature selection techniques 
and a general framework for their categorization can be found in [113]. In 
[67], the authors review feature selection algorithms from a statistical learn- 
ing point of view. In [147], the authors provide a good survey for applying 
feature selection techniques in bioinformatics. In [80], the authors review and 
compare the filter with the wrapper model for feature selection. And in [121], 
the authors explore the representative feature selection approaches based on 
sparse regularization, which is a branch of embedded feature selection tech- 
niques. Representative feature selection algorithms are also empirically evalu- 
ated in [114, 106, 177, 98, 120, 179, 125] under different problem settings and 
from different perspectives to provide insight into existing feature selection 
algorithms. 


1.2.1 A General Formulation for Feature Selection 


Assume we have a data set X € ¡R?*”, with m features and n samples (or 
instances, data points). The problem of feature selection can be formulated as 


max r (3) 
Ww 
st. X=XW, We{o,1}"', 
W Imxi = la, || Wiix1 llo = 1. (1.1) 


In the above equation, r (-) is a score function to evaluate the relevance of 
the features in X: the more relevant the features, the greater the value. W is 
the selection matrix, whose element is either 0 or 1. And ||- llo is the vector 
zero norm [59], which counts the number of nonzero elements in the vector. 
The constraints in the formulation ensure that: (1) W! 1.1 = Luxa: each 
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column of W has one and only one “1.” This ensures the original features 
rather than a linear combination of them to be selected; (2) || W1:x1 llo = 1: 
pane the m rows of W, only l rows contain one “1,” and the remaining 

— l rows are zero vectors; (3) X = XW: X contains 1 different columns 
of X. This guarantees that | of the m features are selected, and no feature is 
repeatedly selected. Altogether, the three constraints ensure that X contains 
| different original features of X. The selected l features can be expressed as 
X =XW = (f;,,-.-,fi,), where {i1,...,a} C {1,...,m}, and usually, L << m. 
Clearly, if r (-) does not evaluate features independently, this problem is non- 
deterministic polynomial-time (NP) hard. Therefore, to make the problem 
solvable, we usually assume features are independent or their interaction order 
is low [220]. 


Example 2 Filtering a data set with a selection matriz 


Figure 1.6 shows how a selection matrix can be used to filter a data 
set with the selected features. The data set X contains three features, and 
we want to select the first and the third features (corresponding to the 
first and the third columns of X). To achieve this, we create a matrix W 
that has two columns. The first element of the first column and the third 
element of the second column are set to 1, and all the other elements of W 
are set to 0. X x W results in a data set X containing the first and the 
third columns of X. 


1 7 3 1 0 1 3 
5 6 4|x|0 O|=|5 4 
10 9 8 0 1 10 8 


xX x W = X 


FIGURE 1.6: A selection matrix for filtering data with the selected features. 


1.2.2 Feature Selection in a Learning Process 


Figure 1.7 shows a typical learning process with feature selection in two 
phases: (1) feature selection, and (2) model fitting and performance evaluation. 
The feature selection phase has three steps: (a) generating a candidate set 
containing a subset of the original features via a certain research strategy; 
(b) evaluating the candidate set and estimating the utility of the features in 
the candidate set. Based on the evaluation, some features in the candidate 
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set may be discarded or added to the selected feature set according to their 
relevance; and (c) determining whether the current set of selected features 
are good enough using a certain stopping criterion. If so, the feature selection 
algorithm returns the set of selected features, otherwise it iterates until the 
stopping criterion is met. In the process of generating the candidate set and 
evaluation, a feature selection algorithm may use the information obtained 
from the training data, the current selected features, the target learning model, 
and some given prior knowledge [76] to guide their search and evaluation. 
Once a set of features is selected, it can be used to filter the training and 
the test data for model fitting and prediction. The performance achieved by a 
particular learning model on the test data can also be used as an indicator for 
evaluating the effectiveness of the feature selection algorithm for that learning 
model. 


Feature Selection phase I 


Training Feature Subset Stop 
Data Generation Criterion 


Validation 
Data 


$ : Best Subset 
Test Test Learning Train Learning 


Data Model Model 
Training and Validation Data 


Model Fitting / Performance Evaluation phase II 


FIGURE 1.7: A learning process with feature selection. 


1.2.3 Categories of Feature Selection Algorithms 


Feature selection algorithms can be classified into various categories from 
different perspectives. Below we show five different ways for categorizing fea- 
ture selection algorithms. 


1.2.3.1 Degrees of Supervision 


In the process of feature selection, the training data can be either la- 
beled, unlabeled, or partially labeled, leading to the development of super- 
vised, unsupervised, and semi-supervised feature selection algorithms. In the 
evaluation process, a supervised feature selection algorithm [158, 192] deter- 
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mines feature relevance by evaluating their correlation with the class or their 
utility for creating accurate models. And without labels, an unsupervised fea- 
ture selection algorithm may exploit feature variance or data distribution to 
evaluate the feature relevance [47, 74]. A semi-supervised feature selection al- 
gorithm [221, 197] can use both labeled and unlabeled data. The idea is to 
use a small amount of labeled data as additional information to improve the 
performance of unsupervised feature selection. 


1.2.3.2 Relevance Evaluation Strategies 


Different strategies have been used in feature selection to design feature 
evaluation criteria r (-) in Equation (1.1). These strategies broadly fall into 
three different categories: the filter, the wrapper, and the embedded models. 

To evaluate the utility of features in the evaluation step, feature selection 
algorithms with a filter model [80, 147, 37, 158, 74, 112, 98, 222, 161] rely 
on analyzing the general characteristics of features, for example, the features” 
correlations to the class variable. In this case, features are evaluated without 
involving any learning algorithm. The evaluation criteria r (-) used in the 
algorithms of a filter model usually assume that features are independent. 


Therefore, they evaluate features independently, r (3) =r (£,)+...+r (fi, ). 


Based on this assumption, the problem specified in Equation (1.1) can be 
solved by simply picking the top k features with the largest r (f) value. Some 
feature selection algorithms with a filter model also consider low-order feature 
interactions [70, 40, 212]. In this case, heuristic search strategies, such as 
greedy search, best first search, and genetic-algorithmic search can be used in a 
backward elimination or a forward selection process for obtaining a suboptimal 
solution. 

Feature selection algorithms with a wrapper model [80, 91, 92, 93, 111, 
183, 110] require a predetermined learning algorithm and use its performance 
achieved on the selected features as r (-) to estimate feature relevance. Since 
the predetermined learning algorithm is used as a black box for evaluating 
features, the behavior of the corresponding feature evaluation function r (+) is 
usually highly nonlinear. In this case, to obtain a global optimal solution is 
infeasible for high-dimensional data. To address the problem, heuristic search 
strategies, such as greedy search and genetic-algorithmic search can be used 
for identifying a feature subset. 

Feature selection algorithms with an embedded model, e.g., C4.5 [141], 
LARS [48], 1-norm support vector machine [229], and sparse logistic regres- 
sion [26], also require a predetermined learning algorithm. But unlike an algo- 
rithm with the wrapper model, they incorporate feature selection as a part of 
the training process by attaching a regularization term to the original objec- 
tive function of the learning algorithm. In the training process, the features’ 
relevance is evaluated by analyzing their utility for optimizing the adjusted 
objective function, which forms r (-) for feature evaluation. In recent years, 
the embedded model has gained increasing interest in feature selection re- 
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search due to its superior performance. Currently, most embedded feature 
selection algorithms are designed by applying an Lo norm [192, 79] or an Lı 
norm [115, 229, 227] constraint to an existing learning model, such as the 
support vector machine, the logistic regression, and the principal component 
analysis to achieve a sparse solution. When the constraint is derived from 
the Lı norm, and the original problem is convex, r (-) (the adjusted objective 
function) is also convex and a global optimal solution exists. In this case, var- 
ious existing convex optimization techniques can be applied to obtain a global 
optimal solution efficiently [115]. 

Compared with the wrapper and the embedded models, feature selection 
algorithms with the filter model are independent of any learning model, and 
therefore, are not biased toward a specific learner model. This forms one ad- 
vantage of the filter model. Feature selection algorithms of a filter model are 
usually very fast, and their structures are often simple. Algorithms of a filter 
model are easy to design, and after being implemented, they can be easily 
understood by other researchers. This explains why most existing feature se- 
lection algorithms are of the filter model. On the other hand, researchers 
also recognize that feature selection algorithms of the wrapper and embedded 
models can select features that result in higher learning performance for the 
predetermined learning algorithm. Compared with the wrapper model, feature 
selection algorithms of the embedded model are usually more efficient, since 
they look into the structure of the predetermined learning algorithm and use 
its properties to guide feature evaluation and feature subset searching. 


1.2.3.3 Output Formats 


Feature selection algorithms with filter and embedded models may return 
either a subset of selected features or the weights (measuring the feature rel- 
evance) of all features. According to the type of the output, feature selection 
algorithms can be divided into either feature weighting algorithms or sub- 
set selection algorithms. Feature selection algorithms of the wrapper model 
usually return feature subsets, and therefore are subset selection algorithms. 


1.2.3.4 Number of Data Sources 


To the best of the authors” knowledge, most existing feature selection al- 
gorithms are designed to handle learning tasks with only one data source, 
therefore they are single-source feature selection algorithms. In many real data 
mining applications, for the same set of features and samples, we may have 
multiple data sources. They depict the characters of features and samples 
from multiple perspectives. Multi-source feature selection [223] studies how 
to integrate multiple information sources in feature selection to improve the 
reliability of relevance estimation. Figure 1.8 demonstrates how multi-source 
feature selection works. Recent study shows that the capability of using multi- 
ple data and knowledge sources in feature selection may effectively enrich our 
information and enhance the reliability of relevance estimation [118, 225, 226]. 
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Different information sources about features and samples may have very dif- 
ferent representations. One of the key challenges in multi-source feature selec- 
tion is how to effectively handle the heterogenous representation of multiple 
information sources. 


r a 


Information of Features (1) 


Multi-source = 
Feature Selection z 
Information of Features (p) Relevant Features 
features 


Information Information 


of Samples of Samples 


(1) (q) 


instances 


FIGURE 1.8: Feature selection with multiple data and knowledge sources. 


1.2.3.5 Computation Schemes 


Different computation schemes roughly fall into two categories: serial com- 
putation and parallel computation. Most existing feature selection techniques 
are designed for serial computation in a centralized computing environment. 
An advantage of this computing scheme is its simplicity. However, in recent 
years, the size of data sets in data mining applications has increased rapidly. 
It is common to have a data set of several terabytes (TB, 212 bytes). A data 
set of this size poses scalability challenges to existing feature selection algo- 
rithms. To improve the efficiency and scalability of existing algorithms, paral- 
lel computation techniques, such as such as Message Passing Interface (MPI) 
[163, 63] and Google’s MapReduce [1], can be applied [160]. By utilizing more 
computing (CPU) and storage (RAM) resources, a parallel feature selection 
algorithm is capable of handling very large data sets efficiently. 


1.2.4 Challenges in Feature Selection Research 


Although much work has been done on research of feature selection and a 
large number of algorithms have been developed, as new applications emerge, 
many challenges have arisen, requiring novel theories and methods to address 
high-dimensional and complex data. Below, we consider some of the most 
challenging problems in feature selection research. 
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1.2.4.1 Redundant Features 


A redundant feature refers to a feature that is relevant to the learning 
problem, but its removal from the data has no negative effect.? Redundant 
features unnecessarily increase dimensionality [89], and may worsen learning 
performance. It has been empirically shown that removing redundant features 
can result in significant performance improvement [69]. Some algorithms have 
been developed to handle redundancy in feature selection [69, 40, 56, 210, 6, 
43]. However, there is still not much systematical work that studies how to 
adapt the large number of existing algorithms (especially the algorithms based 
on the filter model) to handle redundant features. 


1.2.4.2 Large-Scale Data 


Advances in computer-based technologies have enabled researchers and 
engineers to collect data at an ever-increasing pace [1, 215, 50]. Data were 
measured in megabytes (MB, 2% bytes) and gigabytes (GB, 2° bytes), then 
terabytes (TB, 212 bytes), and now in petabyte (PB, 215 bytes). A large-scale 
data set may contain a huge number of samples and features. Most exist- 
ing feature selection algorithms are designed for handling data with a size 
under several gigabytes. Their efficiency may significantly deteriorate, if not 
become totally unapplicable, when data size exceeds hundreds of gigabytes. Ef- 
ficient distributed computing frameworks, such as MPI [163, 63] and Google's 
MapReduce [1], have been developed to facilitate applications on cloud infras- 
tructure, enabling people to handle problems of very large scale. Most existing 
feature selection techniques are designed for traditional centralized computing 
environments and cannot readily utilize these advanced distributed computing 
techniques to enhance their efficiency and scalability. 


1.2.4.3 Structured Data 


Not only are data sets getting larger, but new types of data are emerg- 
ing. Examples include data streams from sensor networks [2], sequences in 
proteinic or genetic studies [174], hierarchial data with complex taxonomies 
in text mining [49], and data in social network analysis [152] and system 
biology [5]. Existing feature selection algorithms cannot handle these com- 
plex data types effectively. For instance, in many text mining applications, 
documents are organized under a complex hierarchy. However, most existing 
feature selection algorithms can only handle class labels with a flat struc- 
ture. Also, in the cancer study, feature selection techniques are applied on 
microarray data for identifying genes (features) that are related to carcino- 
genesis. Genetic interaction networks can be used to improve the precision of 
carcinogenic gene detection [224]. For instance, recent studies show that most 
carcinogenic genes are the core of the genetic interaction network [134, 189]. 
However, to the best of the authors” knowledge, most existing algorithms can- 


3Mainly due to the existence of other features which is more relevant. 
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not integrat the information contained in a genetic interaction network (a 
network of feature interaction) in feature selection to improve the reliability 
of relevance estimation. 


1.2.4.4 Data of Small Sample Size 


Opposite to the problem discussed in Section 1.2.4.2, in which sample size 
is tremendous, another extreme is a terribly small sample size. The small 
sample problem is one of the most challenging problem in many feature se- 
lection applications [143]: the dimensionality of data is extremely high, while 
the sample size is very small. For instance, a typical cDNA microarray data 
set [88] used in modern genetic analysis usually contain more than 30000 fea- 
tures (the oligonucleotide probes), yet the sample size is usually less than 100. 
With so few samples, many irrelevant features can easily gain their statistical 
relevance due to sheer randomness [159]. With a data set of this kind, most 
existing feature selection algorithms become unreliable by selecting many ir- 
relevant features. For example, in a cancer study based on cDNA microarray, 
fold differences identified via statistical analysis often offer limited or inaccu- 
rate selection of biological features [118, 159]. In real applications, the number 
of samples usually do not increase considerably, since the process of acquiring 
additional samples is costly. One way to address this problem is to include 
additional information to enhance our understanding of the data at hand. For 
instance, recent developments in bioinformatics have made various knowledge 
sources available, including the KEEG pathway repository [87], the Gene On- 
tology database [25], and the NCI Gene-Cancer database [151]. Recent work 
has also revealed the existence of a class of small noncoding RNA (ribonucleic 
acid) species known as microRNAs, which are surprisingly informative for 
identifying cancerous tissues [118]. The availability of these various informa- 
tion sources presents promising opportunities to advance research in solving 
previously unsolvable problems. However, as we pointed out in Sections 1.2.3.4 
and 1.2.4.3, most feature selection algorithms are designed to handle learn- 
ing tasks with a single data source, and therefore cannot benefit from any 
additional information sources. 


1.3 Spectral Feature Selection 


A good feature should not have random values associated with samples. 
Instead, it should support the target concept embedded in the data. In su- 
pervised learning, the target concept is the class affiliation of the samples. 
In unsupervised learning, the target concept is the cluster affiliation of the 
samples. Therefore, to develop effective algorithms for selecting features, we 
need to find effective ways to measure features’ consistency with the target 
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concept. More specifically, we need effective mechanisms to identify features 
that associate similar values with the samples that are of the same affiliation. 

Sample similarity is widely used in both supervised and unsupervised 
learning to describe the relationships among samples. It forms an effective way 
to depict either sample cluster affiliation or sample class affiliation. Spectral 
feature selection is a newly developed feature selection technique. It evalu- 
ates features’ relevance via measuring their capability of preserving the pre- 
specified sample similarity. More specifically, assuming the similarities among 
every pair of samples are stored in a similarity matrix S, spectral feature 
selection estimates the feature relevance by measuring features’ consistency 
with the spectrum of a matrix derived from S, for instance, the Laplacian 
matrix [33].* 


Example 3 The top eigenvectors of a Laplacian matriz 


Figure 1.9 shows the contour of the second and third eigenvectors of a 
Laplacian matrix derived from a similarity matrix S. The color of the sam- 
ples denotes their class or cluster affiliations. The gray level of the back- 
ground shows how eigenvectors assign values to the samples. The darker 
the color, the smaller the value. 

The figure shows that the second and third eigenvectors assign similar 
values to the samples that are of the same affiliations. So, if a feature is 
consistent with either of the two eigenvectors, it will have a strong capabil- 
ity of supporting the target concept, which defines the affiliation of samples. 


Spectral feature selection is a general feature selection framework. Its ad- 
vantages include: 


e A unified framework: Spectral feature selection forms a general frame- 
work that enables the joint study of supervised, unsupervised, and semi- 
supervised feature selection. With this framework, families of novel fea- 
ture selection algorithms can be designed to handle data with different 
characteristics. 


e A solid theoretical foundation: Spectral feature selection has a solid the- 
oretical foundation, which is supported by spectral graph theory [33], 
numerical linear algebra [38], and convex optimization [131, 18]. Its prop- 
erties and behaviors can be effectively analyzed for us to gain insight for 
improving performance. 


e Great generability: Spectral feature selection includes many existing suc- 
cessful feature selection algorithms as its special cases. This allows us to 


“The concepts of similarity matrix and Laplacian matrix will be introduced in Chapter 
De 
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FIGURE 1.9: (SEE COLOR INSERT) The contour of the second and 
third eigenvectors of a Laplacian matrix derived from a similarity matrix S. 
The numbers on the top are the corresponding eigenvalues. 


study them together to achieve better understanding; on these algorithms 
and gain interesting insights. 


e Handling redundant features: Any algorithm that fits the framework of 
spectral feature selection can be adapted to effectively handle redun- 
dant features. This helps many existing feature selection algorithms to 
overcome their common drawback of handling feature redundancy. 


e Processing large-scale data: Spectral feature selection can be conve- 
niently extended to handle large-scale data by applying mature com- 
mercialized distributed parallel computing techniques. 


e The support of multi-source feature selection: Spectral feature selection 
can integrate multiple data and knowledge sources to effectively improve 
the reliability of feature relevance estimation. 


1.4 Organization of the Book 


The book consists of six chapters. Figure 1.10 depicts the organization of 
the book. 

Chapter 1. We introduce the basic concepts in feature selection, present 
the challenges for feature selection research, and offer the basic idea of spectral 
feature selection. 
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SPECTRAL FEATURE SELECTION 


Concepts Introduction 


Univariate Multivariate 


Implementation Formulations Formulations 


os Connections 
Generalization to Existing 
Algorithms 


Large-Scale Small Sample 
Application Problem Problem 


(Parallel Feature (Multi-Source 
Selection) Feature Selection) 


FIGURE 1.10: The organization of the book. 


Chapters 2 and 3. Features can be evaluated either individually or 
jointly, which leads to univariate and multivariate formulations for spectral 
feature selection, respectively. We present a spectral feature selection frame- 
work based on univariate formulations in Chapter 2. This general framework 
covers supervised, unsupervised, and semi-supervised feature selection. We 
study the properties of the univariate formulations for spectral feature selec- 
tion and illustrate how to derive new algorithms with good performance based 
on these formulations. One problem of the univariate formulation is that fea- 
tures are evaluated independently. Therefore redundant features cannot be 
handled properly. In Chapter 3, we present several multivariate formulations 
for spectral feature selection to handle redundant features in effective and 
efficient ways. 

Chapter 4. Although spectral feature selection is a relatively new tech- 
nique for feature selection, it is closely related to many existing feature se- 
lection and feature extraction algorithms. In Chapter 4, we show that many 
existing successful feature selection and feature extraction algorithms can be 
considered special cases of the proposed spectral feature selection frameworks. 
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The unification allows us to achieve a better understanding of these algorithms 
as well as the spectral feature selection technique. 

Chapters 5 and 6. Spectral feature selection can be applied to address 
difficult feature selection problems. The large-scale data problem and the small 
sample problem are two of the most challenging problems in feature selection 
research. In Chapter 5, we study parallel spectral feature selection and show 
how to handle a large-scale data set via efficient parallel implementations for 
spectral feature selection in a distributed computing environment. In Chap- 
ter 6 we illustrate how to address the small sample problem by incorporating 
multiple knowledge sources in spectral feature selection, which leads to the 
novel concept of multi-source feature selection. 

Although readers are encouraged to read the entire book to obtain a com- 
prehensive understanding of the spectral feature selection technique, readers 
can choose the chapters according to their interests based on Figure 1.10. 
Chapters 1, 2, and 3 introduce the basic concept of feature selection, and 
show how spectral feature selection works. For the readers who are already 
familiar with feature selection and want to learn the theoretical perspectives 
of spectral feature selection in depth, we recommend they read Chapters 2, 3, 
and 4. Chapters 2, 3, 5, and 6 provide implementation details of spectral fea- 
ture selection algorithms, and can be useful for the readers, who want to apply 
spectral feature selection technique to solve their own real-world problems. 

To read the book, a reader may need some knowledge of linear algebra. 
Some basic convex optimization techniques are used in Chapter 3. Some con- 
cepts from biology and bioinformatics are mentioned in Chapter 6. These 
concepts and techniques are all basic and relatively simple to understand. We 
refer readers not familiar with these concepts and technique to the literature 
provided as references in the book. 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


Chapter 2 


Univariate Formulations for Spectral 
Feature Selection 


Spectral feature selection tries to select features that are consistent with the 
target concept via conducting [171]. In this chapter, we present several uni- 
variate formulations for spectral feature selection, and analyze the properties 
of the presented formulations based on the perturbation theory developed for 
symmetric linear systems [38]. We also show how to derive novel feature se- 
lection algorithms based on these formulations and study their performance. 
Spectral feature selection is a general framework for both supervised and un- 
supervised feature selection. The key for the technique to achieve this is that 
it uses a uniform way to depict the target concept in both learning contexts, 
which is the sample similarity matrix. Below, we start by showing how a sam- 
ple similarity matrix can be used to depict a target concept. 


2.1 Modeling Target Concept via Similarity Matrix 


Pairwise sample similarity is widely used in both supervised and unsu- 
pervised learning to describe the relationships among samples. It can effec- 
tively depict either the cluster affiliations or the class affiliations of samples. 
For example, assume s;; is the similarity between the i-th and the j-th sam- 
ples. Without class label information, a popular similarity measurement is the 
Gaussian radial basis function (RBF) kernel function [21], defined as 


ee || xi = x; |? 
Sij = EXP og E 


where exp (-) is the exponential function and 6 is the parameter for controlling 
the width of the “bell.” This function ensures samples from the same cluster 
have large similarity and samples from different clusters have small similar- 
ity. On the other hand, when class label information is available, the sample 
similarity can be measured by 


1 azi la ata 
0, otherwise, 


21 


22 Spectral Feature Selection for Data Mining 


where n, denotes the number of samples in class l. This measurement en- 
sures that samples from the same class have a nonnegative similarity, while 
samples from different classes have a zero similarity. Given n samples, the 
n x n matrix S containing the sample similarity of all sample pairs, S(i, j) = 
Sij, i,j =1,...,n, is called a sample similarity matrix. S is also called a ker- 
nel matrix [150], if any of its submatrices is positive semi-definite. A matrix 
A € R”*” is called semi-positive definite [150] (A > 0), if and only if 


x! Ax > 0,Vx € R”. 


Example 4 The consistency of a feature reveals its relevance 


In Figure 2.1, the target concept specifies two categories indicated by the 
two ellipses: C1 and C2. Different shapes correspond to the feature values 
of the samples. As we can see, feature F assigns similar values to the sam- 
ples that are of the same category, while F” does not. Compared to F”, by 
using F to cluster or classify samples, we have a better chance of obtaining 
correct results. Therefore, F is more relevant compared with F”. 


Feature F Feature F' 


FIGURE 2.1: Consistency of two different features. 


Given a sample similarity matrix S, a graph G can be constructed to 
represent it. The target concept is reflected by the structure of G. For example, 
the samples of the same category usually form a cluster structure with dense 
inner connections. As shown in Example 4, a feature is consistent with the 
target concept when it assigns similar values to the samples that are from 
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the same category. Reflecting on the graph G, it assigns similar values to the 
samples that are near to each other on the graph. Consistent features contain 
information about the target concept, and therefore help cluster or classify 
samples correctly. 

Given a graph G, we can derive a Laplacian matrix L (to be discussed in 
the next section). According to spectral graph theory [33, 58, 17, 124], the 
structural information of a graph can be obtained by studying its spectrum. 
For example, it is known that the leading eigenvectors of L have a tendency to 
assign similar values to the samples that are near one another on the graph. 
Below we introduce some basic concepts related to a Laplacian matrix and 
study its properties. Based on this knowledge, we show how to measure fea- 
ture relevance using the spectrum of a Laplacian matrix in spectral feature 
selection. The proposed formulations are applicable for both supervised and 
unsupervised feature selection. 


2.2 The Laplacian Matrix of a Graph 


According to sample distribution (or sample class affiliation), a sample 
similarity matrix S can be computed to represent the relationships among 
samples. Given X, we use G(V, E) to denote an undirected graph constructed 
from S, where V is the vertex set, and E is the edge set. The i-th vertex vi 
of G corresponds to x; € X, and there is an edge between each vertex pair 
(vi, vj). Given G, its adjacency matrix, A € R”*”, is defined as aj; = sij. Let 
d = {dj,do,...,dn}, where d; = y, aik, the degree matrix, D € R”*”, of G 
is defined as P a 

zi iv >=) 
DS { 0, otherwise . 
Obviously D is a diagonal matrix. Here d; can be interpreted as an estimation 
of the density around x;, since the more data points that are close to x;, the 
larger the d;. Given the adjacency matrix A and the degree matrix D, the 
Laplacian matrix L and the normalized Laplacian matrix £ are defined as 


L=D-A; £L=D LD. (2.1) 
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FIGURE 2.2: A graph and its Laplacian matrices. 


Example 5 A graph and its Laplacian matrices 


Figure 2.2 shows a graph and its Laplacian matrices. In the graph, the 
number beside each edge is the length of the edge. To compute the similarity 
between x; and x,, we used the Gaussian radial basis function (RBF) [21], 
with ô= 1 


Sij = EXP (- 2 2 


We can see that D is a diagonal matrix. S, L, and £ are all symmetric 
matrices. The off-diagonal elements of L and £ are all negative. We notice 
that the elements in £ are smaller than those in L. This is due to the fact 
that Lij = 27 lij It is also easy for us to verify that L1 = 0, where 1 is 
the vector with all its elements equal to 1. 


kera) ME 
aad" | = exp : 


With the following theorem, we show some properties of D and L [33]. 
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Theorem 2.2.1 Given the Laplacian matrix L of G, we have 


LVxekrR”, 
1 n 
x! Lx = 3 5 Qi j (xi y rj)’. (2.2) 
i,j=l 
2VxER",VtER, 
(x —t*1)'L(x—t*1)=x'Lx. (2.3) 
3. Vx ER", 
x! Lx > 0. (2.4) 


4. Let EI lea E and 0 = 10,0 0, 
L*1=0x1=0. (2.5) 


Proof To prove Equation (2.2), we notice that by the definition of d;, we 
have 


n n 
x'Lx = x Dx-x' Sx= > dix? — > Aig VL; 
i=1 ij=l 
1 n n n 
= p sana 2 
= = ) diz; — 2 ) Qatti + ) dj; 
i=1 i,j=1 j=1 
n 
1 > 2 
= pen Qi j (x; = zj) é 
24 
1,=1 


Equations (2.3-2.5) can be easily derived from Equation (2.2). 
Bl 


Equation (2.4) shows that the Laplacian matrix L is a positive semi-definite 
matrix. And Equation (2.4) and Equation (2.5) together suggest that the 
smallest eigenvalue of L is 0 and its corresponding eigenvector is 1. (0,1) is 
also called the trivial eigenpair of L. 


Theorem 2.2.2 Given the normalized Laplacian matrix £L of G, we have 


1. Vx e R”, 


(2.6) 


2 
Trx Sra, SE 
E a) 


i,j=1 
2VxER",VteR, 
(x — tx d!) T L(x—txrd1P) =x! Lx. (2.7) 
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3. VxekR”, 
x'Lx>0. (2.8) 
4. Given d, i 7 
Lx*d? = 0 x d2. (2.9) 
5. Vi, 1<i<n, 
O< A; < 2. (2.10) 


Proof The proofs of Equations (2.6)-(2.9) are similar to those in Theo- 
rem 2.2.1. Equation (2.10) can be proven by noticing the fact that Vi, 1 < 
1<n, 


A 


A 
n 
£ 
jo) 
A 
— 
D` 
ES 


i j= 
T 2. oe 
< sup 3 5 2a; j (5 + 2) 
x|=4 j=1 1 J 
= 2\|x||2 = 2. 


In the above derivation, the second inequality holds since (a+b) < 
2 (a? +02). 


Similar to Theorem 2.2.1, Equation (2.8) shows that the normalized Laplacian 
matrix £ is positive semi-definite. And Equations (2.8) and (2.9) together 
suggest that the smallest eigenvalue of £ is 0 and d: is the corresponding 
cigenvector. (0, d2) is called the trivial eigenpair of £. 

The following examples show us some interesting properties of the eigen- 
vectors of the Laplacian matrix. 
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Example 6 The spectrum of a Laplacian matriz 


Figure 2.3 plots the contours of six eigenvectors of a Laplacian matrix L. 
The Laplacian matrix is constructed from a data set with 90 instances. The 
instances are drawn from a mixture of three two-dimensional (2D) Gaussian 
distributions [190, 166], which have a unit variance and a mean at (5,0), 
(0,5), and (5, 5), respectively. The RBF function defined in Equation (2.25) 
is used to compute sample similarity with o being set to 1. Since L € 
R9x% let £ be the eigenvectors of L, so we have £ € R®°°*!. Therefore an 
eigenvector € of L assigns a value to each of the 90 instances sampled from 
the 2D Gaussian mixture [15]. 

For any point in the space without a value, we compute its value by 
averaging the values of the nearby points. Note that only the points that 
correspond to the 90 instances sampled from the 2D Gaussian mixture are 
used for computing the values for other points. In the averaging process, 
the values of their neighbors are weighted by their distance to the points 
with values. 

Let é; denote the eigenvector corresponding to A;, which is the i-th 
smallest eigenvalue of the Laplacian matrix L. Figure 2.3 plots the con- 
tours of the eigenvectors £1, €2, 3, 4, Es, and 29. The contours show how 
the eigenvectors assign values to the sample. Basically, the darker the color, 
the smaller the value. The figure shows that the first eigenvector, €,, assigns 
the same value to all the instances, which is consistent with Theorem 2.2.1. 
Since there are three clusters, the second and the third eigenvectors, €2 and 
€3, capture the cluster structure of the data. The fourth and the fifth eigen- 
vectors, £4 and és, capture the subcluster structure of the data. And the 
20th eigenvector, £20, captures the subtle structures of the data, which may 
be created by noise. We also noticed that A,,A2, and Az are significantly 
smaller than the remaining eigenvalues. The example shows that the clus- 
ter structure of a data set can be extracted from the leading eigenvectors 
of its corresponding Laplacian matrix. 
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A2 = 4.3 x 1078 


FIGURE 2.3: (SEE COLOR INSERT) Contours of the eigenvectors 
£1, €2, €3, €4, 65, and £20 of L. 
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Example 7 The spectrum of a normalized Laplacian matriz 


Figure 2.4 plots the contours of six eigenvectors of a normalized Lapla- 
cian matrix L: €1, €2, €3, €4, €5, and £20. The normal Laplacian matrix is 
constructed from the same data used in Example 6. The figure shows that 
the first eigenvector, é, captures the density information of the data, which 
is consistent with Theorem 2.2.2. Similar to the last example, the second 
and the third eigenvectors, €2 and €3, capture the cluster structure of the 
data. The fourth and the fifth eigenvectors, €4 and £5, capture the subclus- 
ter structure of the data. And the 20th eigenvector, £29, captures the subtle 
structures of the data, which might correspond to noise. Again, we observe 
that A1, A2, and Az are significantly smaller than the remaining eigenvalues. 


According to the spectral graph theory, the eigenvalues of L measure 
the “smoothness” of their corresponding eigenvectors. More specifically, the 
smaller the eigenvalue, the smoother the corresponding eigenvector. Here the 
smoothness measures how often an eigenvector assigns similar values to sam- 
ples that are near each other. This can be explained by Equation (2.2) 


1 n 
Ai = & LEx = XO aig Cea Eng) 


ij=1 


If an eigenvector does not vary much locally (that is, it always assigns similar 
values to samples that are near to each other on the graph), then its corre- 
sponding eigenvalue A; will be a very small value. This justifies the usage of 
the scale of A; to measure the smoothness of its corresponding eigenvector £;. 
Motivated by the above observation, we study below how the graph Laplacian 
matrix can be used for evaluating feature relevance. 


2.3 Evaluating Features on the Graph 


Equation (2.2) shows that given G, the Laplacian matrix of G is a linear 
operator on vectors 


1 
< f,Lf >= f' Lf = 5 So ay(fi- Fy, f = (fi, fas... Ln)" ER”. 
(2.11) 
As discussed in the last section, the equation quantifies how much f varies 
locally or how “smooth” it is over G. More specifically, the smaller the value 
of < f, Lf >, the smoother the vector f is on G. A smooth vector f assigns 
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A2 = 4.6 x 1073 


FIGURE 2.4: (SEE COLOR INSERT) Contours of the eigenvectors 
£1, €2, €3, €4, 65, and £20 of L. 
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similar values to the samples that are close to one another on G, thus it is 
consistent with the graph structure. < f, Lf > can be used to measure the 
consistency of features on a graph. 

However, given a feature vector f; and L, two factors affect the value of 
< f;, Lf; >: the norm of f; and the norm of L. The two factors need to be 
removed, as they do not contain structure information of the data, but can 
cause the value of < f;,Lf; > to increase or decrease arbitrarily. The two 


factors can be removed via normalization. Based on the relationship between 
L and £, we have 


< fi, Lf; >= E Lf, = f£ D?LD?f, = (D?f,) L(D?f,). 


Let f, = (D2f,) denote the weighted feature vector of F;, and f = E | 


normalized weighted feature vector. The score of F; can be evaluated by the 
following function: 


DE) E A (2.12) 


With the following theorem, we show the relationship between p1(F,) and 
the normalized cut of a graph. Normalized cut is a concept from spectral 
graph theory. It measures the capability of a cluster indicator for partitioning 
the data into well-separated clusters. The following theory suggests that if a 
feature is relevant, by using the feature as a “soft” cluster indicator, we can 
obtain a “clear cut,” which partitions the data into well-separated clusters. 


Theorem 2.3.1 pi(F;) measures the value of the normalized cut [156] by 
using f; as the soft cluster indicator to partition the graph G. 


Proof: The theorem holds as 


Given a graph, assume we partition the graph into two clusters Cı and 
Co. The normalized cut corresponding to Cı and C2 can be calculated by 


cut (C1, Ca) + cut (Ca, Cı) 


t — 
Ut Gh) vol (C2) > 


(2.13) 


where cut(C1, 02) measures the total weight of the edges connecting two clus- 
ters, and vol(C1) measures the total weight of the edges having at least one 
endpoint in cluster C1. They are defined as 


cut (C1, C2) = 5 Qij, vol (C1) = 5 Qij. (2.14) 


1€C1,j€Ca 1€C1Vj 


Equation (2.13) shows that the normalized cut prefers a partition with the 
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following two properties. First, the normalized cut prefers a partition, in which 
the edges between different clusters have low weights, and the edges within 
each cluster have high weights. Such a partition ensures that the instances 
in different clusters are different, and the instances within the same cluster 
are similar. Second, the normalized cut requires the partition to be balanced, 
since vol(C + C2) is a constant, and 1/vol (C1) + 1/vol (C2) is the minimum 


when vol (C1) = vol (C2). Let c = (c1,...,¢n) be a cluster indicator for C4, 
i di A i 

such that if i € Cy, c; = 1, otherwise c; = i where d; is the i-th 
1€C9 “2 


diagonal element of the degree matrix D. In [157], it is shown that that 


c'Le 
cIDe 


= cuty (C1, C2). (2.15) 


Equation (2.15) computes the normalized cut value corresponding to the 


cluster indicator c. The values of the elements in c are either 1 or > a 
1609 di 


therefore it is called a discrete cluster indicator. If we relax this requirement 
and allow c; € R, then the corresponding c becomes a soft cluster indicator. 
A good soft cluster indicator leads to a small normalized cut value by assign- 
ing similar values to samples that are near one another on the graph. Both 
theoretical and empirical results show that the normalization step makes the 
normalized cut more robust to outliers [157, 39, 128]. 


cut 1 cut 2 


FIGURE 2.5: Two possible cuts of a graph. 
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Example 8 The normalized cut of a graph 


In this example, we illustrate the effect of the normalized cut. We compare 
the normalized cut to the standard cut. Given two clusters Cy and C2, in 
the normalized cut, we define the cluster indicator as 


1, ie Ci 
p E d; 
Ci Dicc , i € Co 
Dicc, di 


and 
_ cut (C1, C2) „cut (C2,C1) _ c Le 
SUNS a) vol(Ci) " vol(Cz) cTDe' 


In the standard cut, the cluster indicator is specified as 


and i 
cut (C1, C2) = je Le. 


Figure 2.5 shows a graph containing three clusters. The first two clusters 
both have n instances. In each of the two clusters there is a center point, 
and all other instances in the cluster connect to the center point with a 
distance of 1. The center points of the two clusters are connected and the 
distance is 2. The third cluster has only one instance, and the instance 
is connected to the center point of the second cluster with a distance of 
3. The figure shows two possible cuts of the graph. The first one cut the 
edge connecting the first and the second cluster, and the second one cut 
the edge connecting the second and the third cluster. Figure 2.6 shows how 
cut (C1, C2) and cut y (C1, C2) vary with the size of the first and the second 
clusters.! When there is only one instance in the first and the second clus- 
ter, both cut (-) and cutu (-) assign smaller cut values to cut 2, since cut 2 
cuts an edge with a longer distance. In this case both cut (-) and cutu (-) 
prefer cut 2. However, when the number of instances, n, increases, cutu (-) 
begins to assign a smaller cut value to cut 1, since cut 1 cuts the graph in 
a more balanced way. Compared with cut y (-), the value of cut (-) does not 
change with n, since it does not consider the size of the clusters. 


Given the normalized Laplacian matrix £, we can calculate its eigen- 
decomposition (A;, €;), where A; is the eigenvalue and €; is the eigenvector 


1To compute the similarity matrix for the graph, we use a RBF kernel function with 
6 =2. 
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Normalized Cut vs. Cut 


1 2 3 4 5 6 7 8 9 10 


—©-cut1 --o- cut 2 ncut 1 --&- ncut 2 


FIGURE 2.6: (SEE COLOR INSERT) The cut value (y-axis) of different 
types of cut under different cluster sizes (x-axis). The x-axis corresponds to 
the value of n in Figure 2.5. 


(1 <i < n). Assuming A < A2 < ... < An, according to Theorem 2.2.1, 
Le 

we have: A = 0 and & = sar which form the trivial eigenpair of the 
21 

graph. Also we know that all the eigenvalues of £ are in the range of [0, 2]. 

Given a spectral decomposition of £L, we can rewrite Equation (2.12) using 


the eigensystem of £ to achieve a better understanding of the equation. 


Theorem 2.3.2 Let (A;,£;), 1< j < n be the eigensystem of L, and a; = 
cos 0; where 6, is the angle between f; and £j. Equation (2.12) can be rewritten 
as 


gil Fi) = SN a5 Xj, where Na; =1. (2.16) 
j=1 j=1 
Proof: Let © = DIAG(A1,A2,...,An) and U = (€1,€2, -..,En). As [Él] = 
|£;]| = 1, we have LE = cos6;. We can rewrite fı £f, as 
Ê cf, = Ê USUTÊ = (Oia a = X ai. 
i=l 


aT A x 
Since 577,05 = fi UU"£,, UU" = I and ||f,|| = 1, we have >>5_, 05 = e 
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Theorem 2.3.2 suggests that by using Equation (2.12) the score of F; is 
calculated by combining the eigenvalues of £, and cos6),...,cos@, are the 
combination coefficients. Note, here cos 0, measures the similarity between 
the the feature vector and the i-th eigenvector of £. Since A, = 0, Equa- 


tion (2.16) can be rewritten as e f, = => Aj, meaning that the value 
obtained from Equation (2.12) evaluates the smoothness of f; by measuring 
the similarities between f; and those nontrivial eigenvectors of £. Assuming 
that f aligns closely to the top eigenvectors of £, clearly D a? Aj will be 
small. As shown in Figure 2.4, the top eigenvectors of £ assign similar values 
to the instances from the same cluster. Therefore if a feature aligns closely 
with these eigenvectors, it will be smooth on the graph. 

Since »7;-, 0} = 1 and a; > 0, we have ) 5,05 < 1. The bigger the 
ai, the smaller the X37- a7 is. The value of pi(F;) can be small if f; is very 
similar to €;. However, in this case, a small pi(F;) value does not indicate 
better separability, since the trivial eigenvector €; does not carry any distri- 
bution information except the density around samples. To handle this issue, 
we propose to use )_j_» a4 to normalize pi(F;), which gives us the following 
ranking function: 


eo Fi) = = (2.17) 


Sor A 

mas î cf 
n AT 9: 
2% 1 — (f E) 


A small y2(F;) indicates that f, aligns closely to those nontrivial eigenvectors 
with small eigenvalues, hence it is smooth on the graph. 

According to spectral clustering theory, the leading k eigenvectors of £ 
form the optimal soft cluster indicators that separate G into k parts. The 
remaining eigenvectors correspond to the subtle structures formed by noise. 
Therefore, if k is known, for instance, we know that the data set contains 
samples from k different categories, which should form k dense clusters. We 
can also estimate feature relevance by the following function: 


k 
y3(Fi) =X 2- A)o?. (2.18) 


j=2 

By its definition, p3 assigns bigger scores to features that are more relevant. 
This is because if a feature achieves a large score with p3, it must align closely 
to the nontrivial eigenvectors £2,...,€x, with £2 having the highest priority. By 
focusing on the leading eigenvectors, y3 can effectively reduce noise. Similar 
mechanisms are also used in principal component analysis (PCA) [15] and 
spectral dimension reduction [145, 119, 75, 13, 148, 198] for eliminating noise. 
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Example 9 Evaluating features with pi(-), pal-), pal-) 


Figure 2.7 plots the contours of six features: F, Fo, F3, Fy, Fs, and Fg. 
Among the six features, Fı and F are relevant features and correspond to 
the first and second dimensions of the 2D Gaussian mixture generated in 
Example 6. F3,..., Fe are randomly generated, with their values following 
the standard uniform distribution, and thus are irrelevant features. Since 
L € R90x9%, F; e RX! i= 1,...,6, any of the six features will assign a 
value to each of the 90 instances sampled from the 2D Gaussian mixture. 
To generate a contour for a feature F;, for any point in the space without 
a value, we compute its value by averaging over the values of its nearby 
points, which are assigned values by F;. In the averaging process, the values 
from its neighbors are weighted by their distance to it. The normalized 
Laplacian matrix £ constructed in Example 7 is used in computing «1 (-), 
p2(-), and pa(-) for feature evaluation. 

From Figure 2.7, we can observe that the two relevant features, F1 and 
F>, are smoother than the other four irrelevant features on the graph. They 
can all be identified as relevant by the three feature ranking functions. The 
results show that they achieve small values with (1 (-) and pa(-), and large 
values with (p3(-). 


2.4 An Extension for Feature Ranking Functions 


A Laplacian matrix is also used in supervised 
larization functions to penalize predictors that vary abruptly among adjacent 
vertices on a graph. In [162], the authors relate t 
Fourier basis [170] and extend the usage of £ to y 


matrix function [59] defined as 


y(L) 


learning for designing regu- 


he eigenvectors of £ to the 
L), where y(-) is a spectral 


Dyan ONE 


(2.19) 


In the formulation, y(A;) is an increasing function that adjusts the eigenvalues 
of the normalized Laplacian matrix £. It is pointed out in [217] that due to 
the existence of noise, the difference between the small eigenvalues and the 
large eigenvalues shrinks. By using a high-order spectral matrix function, we 
can effectively enlarge this difference. 

Given a normalized Laplacian matrix £, its eigenvalues measure the 
smoothness of the corresponding eigenvectors. The smaller the eigenvalue, 
the smoother the eigenvector. By applying an increasing function y(-), we 
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pi(F1) = 0.012 pi(£2) = 0.009 
pa(F,) = 0.031, pa(F,) = 0.377 pa(Fa) = 0.027 pa(Fa) = 0.537 


x X 
yi(F3) = 0.239 (1 (Fr) = 0.346 
p2(F3) = 1.030, y3(F3) = 0.015 yo(F4) = 1.111, y3(F4) = 0.000 


p1(F5) = 0.204 pi (Fs) = 0.266 
pa(Fs) = 0.918, pa(F5) = 0.015 yo(Fe) = 1.059, pa(F5) = 0.003 


FIGURE 2.7: (SEE COLOR INSERT) Contours and the scores of six 
features. Among these features, F1 and F are relevant, and F3, F4, F5, and 
Fé are irrelevant. 
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reduce the value of the estimated smoothness of the eigenvectors with large 
eigenvalues, which usually contain structure information generated by noise. 
It is shown in [217] that using the y(£) to design the regularization functions 
can effectively improvement the robustness of an algorithm in a noisy learning 
environment. 


Example 10 Eigenvalues of a noise-contaminated Laplacian matriz 


Figures 2.8 and 2.9 plot the distributions of the eigenvalues of £, £, and 


£?. Here £ is the normalized Laplacian matrix constructed in Example 7. 


£ is the noise-contaminated Laplacian matrix, with lêz S = 0.5. From the 


figure, we can observe that the noise has two different effects on the distri- 
bution of the eigenvalues. First, the slope of the distribution becomes flat. 
And second, the gaps between the leading eigenvalues become smaller.? 
Comparing 3 with £, the leading eigenvalues of £3 are much smaller than 
those of £, yet on the other hand, the tailing eigenvalues become much big- 
ger. The tail eigenvectors usually contain information generated by noise. 
In the evaluation process, by using £3, we penalize the tail eigenvectors by 
reducing the value of their estimated smoothness. 


The above example motivates us to extend our feature ranking functions, 
pil), pa(:), and p3(-) to the following forms for improving their robustness 
in a noisy learning environment: 


Aili) = £ (L) 0 = Dan) (2.20) 
RNA) gt ng 

pl) = Á == E 5, (2.21) 
2% 1 — (f &) 


k 
O3x(Fi) = VO). (2.22) 


Calculating the spectral decomposition (or eigen decomposition) of £ can 
be expensive for data with a large number of samples. However, since y(-) is 
usually a polynomial function y(£) can be calculated efficiently by regarding 
L as a variable and applying y(-) on it. For example, assume (A) = AS, 
then y(L) = L’. For ¢3(-), the k leading eigenpairs of £ can be obtained 
efficiently by using fast eigensolvers such as the implicitly restarted Arnoldi 
method [142]. 


2Except the gap between A and Az. 
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MO) 


0.57 


FIGURE 2.8: The distribution of the eigenvalues of £. £ is the Laplacian 
matrix obtained from the original data. 


\(L) (£3) 


0.5 


FIGURE 2.9: The distribution of the eigenvalues of £ and e L is the 
Laplacian matrix obtained from the noise-contaminated data. aa = 0.5. 
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2.5 Spectral Feature Selection via Ranking 


The feature evaluation criteria we just presented can be used in a unified 
framework for spectral feature selection. In this framework, the relevance of 
a feature is determined by its consistency with the structure of the graph 
induced from the given, sample similarity. The three feature evaluation func- 
tions, P1(-), Pa(-), and P3(-), lay the foundation of the framework and enable 
us to derive families of supervised and unsupervised feature selection in a 
unified manner. The pseudo-code of the unified framework is shown in Algo- 
rithm 1. It selects features in three steps: (1) building the similarity matrix S 
and constructing its graphical representation (Lines 1-3); (2) evaluating fea- 
tures using the eigensystem of the graph (Lines 4-6); and (3) ranking features 
in descending order according to feature relevance, and feature selection is ac- 
complished by choosing the desired number of features from the returned list 
(Lines 7-8). We name the framework SPEC [222], stemming from the fact that 
the framework is based on analyzing the spectrums of the normalized Lapla- 
cian matrix £. Algorithm 1 shows the framework for evaluating features one 
by one with a feature evaluation criterion which is independent of any learning 
algorithm. Therefore, the framework is a filter model for feature selection. 


Algorithm 1: SPEC 

Input: X,y, e), L, PE (01, o, 63) 

Output: a list of features 

construct S, the similarity matrix from X (or y); 

construct graph G from S; 

build W, D and L from G; 

for each feature vector f; do 

s | fe DIE; SESpro(i) + OF) 
IDE e 

6 ranking features in descending order according to feature relevance; 

7 return top / features from the ranked list; 


RON 


SPEC is a general framework for feature selection. It can be used to sys- 
tematically derive novel algorithms for different learning contexts. Its three 
key components are: 


1. Sample similarity matrix S 
For example, the matrix constructed using an RBF function introduced 


in Equation (2.25). 


2. Spectral matrix function y(-) 


For example, y(r) =r, y(r) = r?. 
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3. Feature rank function £(-). 


For example, one of the functions in {1(-), Pal), G3(-)}. 


By using different combinations of these components, we can generate fam- 
ilies of new algorithms for spectral feature selection. For instance, depending 
on how label information is used for constructing the similarity matrix S, we 
can use SPEC to generate algorithms for unsupervised, supervised, or semi- 
supervised feature selection. Below we show how to apply SPEC in different 
learning contexts in detail. 


2.5.1 SPEC for Unsupervised Learning 


When only unlabeled data are provided, the similarity matrix S can only 
be constructed in an unsupervised way, and the feature selection algorithms 
generated by SPEC in this case are unsupervised. Below we list some of the 
popular functions for computing the similarity among instances in an unsu- 
pervised learning context. 


1. Linear kernel function: 
Sij =X, Xj +c. (2.23) 


The linear kernel function is the simplest kernel function. It is given by 
the inner product of two instances, plus an optional constant c. 


2. Polynomial kernel function: 


d 
Sij = (ax; x; +c). (2.24) 
The polynomial kernel function is a non-stationary kernel function. In 
the equation, a is the slope parameter, c is the constant term, and d is 
the polynomial degree. 


3. Radial basis function (RBF): 


Xi — Xj 2 
Sij = exp |) : (2.25) 


The adjustable parameter 6 plays a very important role in the perfor- 
mance of the function, and should be carefully tuned according to the 
problem at hand. If overestimated, the exponential will behave almost 
linearly, and the higher-dimensional projection will start to lose its non- 
linear power. On the other hand, if underestimated, the function will 
lack regularization and the decision boundary will be highly sensitive to 
noise in training data. 


4. Cosine similarity function: 


x! x; 
Sij = a > 
Ixl - lx! 
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Cosine similarity is used to measure the similarity between two vectors 
by computing the cosine of the angle between them. It is often used for 
high-dimensional applications such as document clustering and catego- 
rization in text mining. 


2.5.2 SPEC for Supervised Learning 


When class label information is available, the sample similarity matrix 
can be directly formed from label information. For instance, the following 
two functions are usually used for constructing a similarity matrix S in a 
supervised way: 


e ai l, y=y =l 

Lia { 0, otherwise ’ (2.26) 
1 

ic Di { 0, otherwise ` (Zar) 


In Equation (2.27), nu is the number of samples in the 1 class. Compar- 
ing with Equation (2.26), Equation (2.27) normalizes the similarity of the 
samples from the same class by the size of the class. The mechanism is help- 
ful for balancing classes of different sizes. By plugging in a similarity matrix 
constructed by using the label information, one can obtain a spectral feature 
selection algorithm for supervised learning. 


2.5.3 SPEC for Semi-Supervised Learning 


In many real applications, such as text mining and image processing, data 
are abundant, but labeled data are costly to obtain. It is common to have a 
high-dimensional data set with a large number of unlabeled samples but only 
a few labeled samples. The data sets of this kind present a serious challenge to 
supervised feature selection: the so-called “small labeled-sample problem” [82]. 
That is, when the labeled sample size is too small to provide sufficient infor- 
mation about the target concept, supervised feature selection algorithms may 
fail by either unintentionally removing many relevant features or selecting ir- 
relevant features, which seems to be significant only on the small labeled data. 
Unsupervised feature selection algorithms can be an alternative in this case, 
as they are able to use the large amount of unlabeled data. However, as these 
algorithms ignore label information, important hints from labeled data are left 
out and this will generally downgrade the performance of unsupervised feature 
selection algorithms. Under the assumption that labeled and unlabeled data 
are sampled from the same population generated by the target concept, using 
both labeled and unlabeled data is expected to better estimate feature rele- 
vance. Semi-supervised learning is to learn from mixed labeled and unlabeled 
data [30]. 
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The proposed spectral feature selection framework, SPEC, can be naturally 
extended to achieve semi-supervised feature selection through a regularization 
framework, in which a feature's relevance is evaluated by its consistency with 
both labeled and unlabeled data. The idea can be formulated as 


Apu (Fi) + (1 — A) Gs (Fi) - (2.28) 


The first term and the second terms of Equation (2.28) estimate the con- 
sistency of feature F; with the labeled and the unlabeled data, respectively. 
To this end, one can construct two similarity matrices, S, and S,,, from the 
labeled and the unlabeled data, respectively. Hs (-) and ĝu (-) can be obtained 
by applying S, and S, in SPEC. Note, fs (-) and Qu (-) are feature ranking 
functions, and are chosen from (61 (-), $2 (-), s (-)}. They are marked with 
different subscripts, since they are applied to different similarity matrices, Ss 
and S,. 

For a feature to be identified as relevant, it must be consistent with the 
distribution of both the large amount of unlabeled data and the small amount 
of labeled data. This idea is illustrated in Figure 2.10. The samples with 
“o” shape are from the negative class, the samples with “+” shape are from 
positive class, and the remaining ones are unlabeled samples. The ellipses in 
the figure denote clusters. We have two features, F and F”, and they assign 
the same value to samples that are from the same cluster. Note that the 
two features are consistent with different cluster structures. As we can see, 
since the cluster structure of the unlabeled data is ambiguous, feature F and 
feature F” are equally smooth on the unlabeled data. However, feature F is 
more consistent with the labeled data, which suggests that F is more relevant. 


Feature vector f Feature vector f’ 


FIGURE 2.10: Use label information in semi-supervised feature selection. 


44 Spectral Feature Selection for Data Mining 


Note, if we replace s in Equation (2.28) with the normalized mutual 
information [212] between a feature F; and the class y, we will obtain the 
feature evaluation criterion used in sSelect [221], which is one of the first 
semi-supervised feature selection algorithms in the literature. 

One issue related to the framework formulated in Equation (2.28) is that 
the regularization parameter A is data dependent. That is, for different data, 
the best value for the regularization parameter may vary quite a bit. Therefore, 
to find a proper value for the regularization parameter A is crucial to ensure 
the performance of the framework. In [219], a parameter tuning mechanism 
is developed based on studying the cut-value? achieved on the data that only 
contain the selected features. We find the mechanism is often effective, but 
time-consuming. Therefore developing an efficient and effective technique for 
determining the value of the regularization parameter remains an open issue. 


2.5.4 Time Complexity of SPEC 


The time complexity of SPEC largely depends on the cost of (1) building 
the similarity matrix and the calculation of y(-); (2) feature evaluation with 
SPEC; and (3) feature ranking. 

First, we analyze the time complexity of constructing the similarity matrix 
and calculating the y(-) in various learning contexts. In the unsupervised case, 
if we use the RBF function to build the similarity matrix, and y(-) is in the 
form of £L”, the time complexity of this step is ((rn + m)n2). This is because 
we need O(mn?) operations to build S, D, L, and £. We then need O(rn3) 
operations to calculate the y(£). In the supervised case, if Equation (2.27) is 
used to construct the similarity matrix, we need O(n”) operations to compute 
S and L. It can be verified that in this case, D is an identity matrix, D = I. 
L has c (the number of classes) 0 eigenvalues, and all other eigenvalues are 
1. Therefore, we do not need to compute £ and y(£). So in the supervised 
case, the time complexity for this step is O(n). In the semi-supervised case, 
the similarity matrix is obtained by combining S, and S,,, which are the 
similarity matrices constructed in the supervised and the unsupervised ways, 
respectively. Therefore, the time complexity of this step is ((rn + m)n2). 

Upon obtaining y(L), we need O(n?) operations to calculate SFs pgo (i) for 
each feature: transforming f; to f, requires O(n) operations; and calculating Și, 
2, and (p3 needs O(n”) operations.* Therefore, we need O(mn?) operations 
to calculate scores for m features. 

Lastly, we need O(m log m) operations to rank the features. 

In summary, the overall time complexity of SPEC is O ((rn + m)n2) for 
unsupervised or semi-supervised feature selection, and O (mn?) for supervised 


3 Given data X, the cut-value can be calculated as follows: first, constructing the adjacent 
matrix A from X; then, forming the normalized Laplacian matrix £ using A; last, obtaining 
the cut-value by calculating the second smallest eigenvalue of £. 

4For $3, using the Arnoldi method to calculate a few eigenpairs of a large sparse matrix 
needs roughly O(n?) operations, and calculating ¢3 itself needs O(k) operations. 
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feature selection. If y(-) is not used for noise reduction, the time complexity of 
SPEC is O (mn?) for all three learning contexts. Note that in the supervised 
case, the similarity matrix S is usually of special structure, which can be 
utilized for improving efficiency. For instance, if Equation (2.27) is used to 
construct the similarity matrix, the time complexity of SPEC can be improved 
to O(emn). In this case, S has only c nonzero eigenvalues,? and L = I — S. 

Below we analyze the effect of components of SPEC, and provide a guide- 
line for users to choose the proper components according to the reality of 
different applications. 


2.6 Robustness Analysis for SPEC 


Being robust is important for feature selection algorithms [146, 211]. A 
feature selection algorithm is not robust if a small perturbation of the origi- 
nal data can cause a great change in its output. Although the perturbations 
can be various (e.g., caused by noise), the underlying target concept remains 
unchanged. Therefore a good feature selection algorithm should be robust to 
the potential perturbations. Below, we provide a robustness analysis for fea- 
ture ranking functions 1(-), pa(-), and y3(-). The analysis is based on the 
perturbation theory developed for symmetric linear systems [38], and can be 
extended to P1(-), G2(-), and ĝ3(-) easily. We first present two theorems, which 
serve as the basis for the following analysis. 


Theorem 2.6.1 (Weyl) Let A and E be n-by-n symmetric matrices. Let 


A <... S An be the eigenvalues of A and Mack Sines Ns be the eigenvalues 
of A=A+E, JA; — AI < EI. 


Assume A is the perturbed A, and E corresponds to noise. Theorem 2.6.1 
shows that the eigenvalues of the perturbed matrix A are bound by their 
corresponding eigenvalues of A and the scale of the perturbation matrix E, 
which is measured by its norm. 


Theorem 2.6.2 Let A and E be n-by-n symmetric matrices. Let A = 
QAQ' = Qdiag(A¡)Q' be an eigen decomposition of A. Let A +E = A= 


QAQ™ be the perturbed eigen decomposition. Write Q = [€1,... En] and 
Q = [£1,...,£n], where & and €; are the unperturbed and perturbed unit eigen- 


vectors, respectively. Let 0; denote the acute angle between €; and Es. Provided 
that GAP(j, A+ E) > 0, we have the following inequality 


IEI» 
GAP (j, A+ E)’ 


Note that when 0; < 1, then 5 sin 20; ~ sind; =0;. 


1 
5 sin 26; < (2.29) 


5 All these eigenvalues equal 1. 
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Proofs of the two theorems can be found in Section 5.2 of [38]. In The- 
orem 2.6.2, GAP(j, A + E) denotes the eigengap [129, 128] of Aj, where A; 
is the j-th eigenvalue of A + E. Formally we can define GAP(j, A +E) as 
GAP(j, A + E)=min;z,; |À; — Aj]. According to Theorems 2.6.1 and 2.6.2, the 
robustness of the eigenvalues is determined by the scale of the perturbation 
matrix E, which is measured by its norm. And the robustness of the eigen- 
vectors is determined by the scale of the perturbation matrix E as well as the 
eigengap of the corresponding eigenvalue. 


Example 11 The effect of noise on the eigensystem 


Figure 2.12 shows how eigenvalues and eigenvectors of £ are affected 
when different amounts of noise are added to the data. Here £ is the nor- 
malized Laplacian matrix constructed in Example 7. Let L be the noise- 
contaminated Laplacian matrix, and a = laz „a reflects how much noise 
has been added to the data. In the figure, the x-axis corresponds to a. To 
generate the plots, we gradually increase a from 0 to 0.5. The y-axis of 
Figure 2.12(a) corresponds to the scale of the eigenvalues. The y-axis of 
Figure 2.12(b) corresponds to the scale of sin(0.), where ĝe is the angle 


between £, the original eigenvector, and £, the noise-perturbed eigenvector. 

Figure 2.11 plots the distribution of the samples when the original data 
is perturbed using a = 0.3. The chart shows that with noise perturbation, 
the cluster structures of the data become blurred. Figure 2.12(a) plots the 
values of the Az, A3, A10, and Ago under perturbations of different scales. 
The plot shows that when more noise is added, the leading eigenvalues of 
L become bigger and the gap between leading and tail eigenvalues becomes 
smaller. This corresponds to the fact that when more noise is added, the 
cluster structures of the perturbed data become blurred. Figure 2.12-(b) 
plots the values of sin(6.) when the scale of the perturbation varies. The 
plot shows that the leading eigenvectors are more robust to noise, since 
they have a bigger eigengap, which is consistent with the results presented 
in Theorem 2.6.2. 


Based on the two theorems, we provide an error upper-bound analysis for 
feature ranking functions p1(-), pa(-) and p3(-), when the original data are 
perturbed by noise. In general, noise can cause two types of perturbation that 
will affect the outputs of ranking functions. They are (1) the perturbation of 
the Laplacian matrix £, which is denoted as £,; and (2) the perturbation of 
the feature vector f, which is denoted as f.. Without loss of generality, for the 
original data and its perturbation, by assuming ez > 0, we have the following 
specifications: 


E= L+ Lea Elle = ([Lll2 = 1, [|Lelle < ez. (2.30) 
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FIGURE 2.11: The effect of noise on sample distribution. 


(a) Az, Az, A10, A30 (b) €2, 63, £10, €30 


0 0.05 0.1 0.15 02 025 03 0.35 0.4 045 0. 0.05 0.1 0.15 02 025 03 0.35 04 045 05 


FIGURE 2.12: The effect of noise on eigenvalues and eigenvectors. The x- 
axis is the a value, which reflects how much noise has been added to the 
data. 
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F= f+ t, Ela = Ela = 1, Ea < er. (2.31) 


In the above equations, £ is the original Laplacian matrix; £ is the per- 
turbed Laplacian matrix; and £e is the corresponding perturbation matrix. A 
similar relationship holds for f, f., and f, where f is a feature vector. With 
the following theorem, we show that we can bind the perturbation errors of 
the three feature ranking functions (1(-), pa(-), and pa(-) by ez, er, and the 
eigengap of the eigenvalues of Ê. 


Theorem 2.6.3 Assume (Aj, €) is the eigensystem of L, and aj = cos6;, 
where 0; is the angle between f and Ej. Also let Aj, Es) be the eigensystem 
of L, and Qj = cos 0; where 0; is the angle between f and Es. We have 


d d d d 
x 1 
> Aj -5 azAj| < (all£dl2 + > Aj sin G arcsin 2€;, ,) + Ep > Aj], 
j=l j=l j=l j=l 

(2.32) 


where €j, 9 = anit and êp = 2ef + ez. Note that when arcsin 2€;, y < 1, 


d d 
1 
q\|Lellg + ) Aj sin G arcsin 2e;, ) +6 > Aj 
j=l 


j=l 


q q 
~ | all Cella + Azez, o + ep Y) Aj |. (2.33) 
j=1 j=1 


Proof: We first provide an upper bound for & Àj using e, ef and the eigengap 


of L. For convenience, in the first part of the proof, we drop off the subscript 
j from a% 25, if it does not cause confusion. Let € = € + Es À = = A + Ae. We 
have: 


ax = cos? (0) = AAS (EHT E+E) 
EHT (e) AAA DT E) 

a5 rara) aA e (NE + Ell E + E? 
se AENA AE) +A 
A ( (ET (E+ ED)? + lll I+ El)? + 2 ela 1E + El Ia) +A 


II 


IA 


Since lll = |[él2 = 1 and [Ela = [Ella = 1, we have 


BALA (El Ext) + d+ 2er) RNa 


Univariate Formulations for Spectral Feature Selection 49 


In the above derivations, we applied the Cauchy-Schwarz inequality: 
x! y < ||x|||ly||. By noticing that £! (€ +£) < cos(0 — ĝe), where 0. is the 
angle between £ and £, we have 


(£! (£+ E.) < cos” (9 — 0.) — cos? (0) + cos? (0) 
= sin (29 — 6.) sin (6e) + cos? (0) 
| sin (6e) | + cos? (0). 


A 


Based on the two sets of results we just obtained, we have the following in- 
equality 


SA 


< Ae +A (EF +2) +A] sin (8e) | + Acos? (0), 
and N 
CAPAS Ae +A (Es + Bez) +A] sin (04) |. 
According to Theorem 2.6.2, we know A. < ||£.|| and |sin(0.)]| < 


f : Le 
sin (3 arcsin 2e;, 9), where €j, 9 = GA Nella 


GapGLtL£e)° We obtain the following result: 


q q 
5. adj -X2 < [allLell, Na sin (5 arcsin 2€;, o) HEN 
j=1 j=l 


j=l 
Also, by noticing that when 0 < 1, 3 sin 20 = sind = 0, we can obtain 
Equation (2.33). 

E 


In ĝi(-), i = 1,2,3, and we apply y(-) to rescale the eigenvalues of £ before 
calculating the feature scores. Based on Theorem 2.6.3, we can also provide 
a robustness analysis for ĝ;(-). In the analysis we assume y(-) is a rational 
function and has the form y(A) = A”. 


Theorem 2.6.4 Let (A;,€;) be the eigensystem of L, and aj = cos 05, 1 where 
0; is the angle between f and £j. Let (Ay, Es) be the eigensystem of L, and 
Qj = COS 0, where 0, is the angle between f and Es. Also let dy = Aj + ej 
and assume pr; > Aj, for all 1 <i <n. With y(A) = A, the jolla 
inequality holds: 


j=1 j=1 
q p 
+1 1 es y 1 š r a \r 
< y (Wed (p+1) A7 1 4sin G arcsin 2€;, ) A; + eas) „ (2.34) 
j=1 
Le : 
where €j, y = l£d and és = 2e¢ + e. Note, when arcsin éj, 9 < 1, 


GAP (j, L + Le) 
we have 
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q 


- 1)” — 1 1 
5 (ne. llo (pt) See +sin G arcsin 2e;, o) A; + eas) 


j=1 P 
q r 
enel are 
< Y (iek! llo - NOt + 6, HEA), (2.35) 
j=l 


Proof: This is true because of the following inequality 
aN" = (A+2z)" cos? (9) — Y cos? (9) +d" cos? (9) 

= ((A +A)” — A) cos? (9) + A" cos? (6) 

= A (AFTE Ac $F ATP + ATI) cos? (6) + A” cos? (8) 


o Ti (1+(1+p)+---+(14 pr) cos? (9) +X" cos? (5) 
= AA Cto! cos? (9) + A” cos? (9) 
< gyrate? + À” cos? (5) . 

p 


When the original data are perturbed by noise, feature scores will change 
accordingly. The two theorems show that before and after the perturbation, 
the difference of the feature scores is bound by ez, eg, and the eigengap of 
L. Among these factors, e, corresponds to the matrix perturbation, éf corre- 
sponds to the feature vector perturbation, and the eigengap of £ corresponds 
to matrix stability [38]. According to spectral clustering theory, if a data set 
has an easily separable cluster structure, the top eigengaps of its Laplacian 
matrix should be large. Since the cluster structure of the data is clear, a small 
amount of noise will not be able to alter the structure easily. Therefore, a 
Laplacian matrix with larger top eigengaps is considered to be more robust 
to noise. 

Based on the above theorems, we have the following points regarding the 
robustness of the feature ranking functions Și (-), Pa(-), and @3(-) 

a) Discarding the tail eigenpairs in feature evaluation helps increase ro- 
bustness. For a graph with well-separable cluster structures, its tail eigenvalues 
are usually packed in a small range, thus having small eigengaps [128]. Equa- 
tion (2.34) suggests including eigenpairs with small eigengaps can increase its 
sensitivity to noise. Also, it is known that for a graph, its tail eigenvectors usu- 
ally correspond to the subtle structures formed due to noise [188]. Therefore, 
removing them helps improve robustness. 
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b) Among the three feature ranking functions, p3(-) is more robust than 
p1(-) and pa(-), since it discards the least robust tail eigenpairs in evaluation. 
pi(-) and pa(-) are equally robust, since Ay and & are constants. Although 
p3(-) is more robust in theory, to perform well, it needs a proper threshold 
to determine which tail eigenpairs should be discarded. Discarding either too 
many or too few eigenpairs may cause loss of information or inclusion of noise. 
In [128] the authors proposed using a spectrum gap (the eigengap that divides 
eigenvalues into two well-separable groups) or the known number of clusters 
to determine the threshold. 

c) Theorem 2.6.4 suggests that if only leading eigenpairs are used for 
computing feature scores, then a high-order rational function can increase 
robustness, since the leading eigenvalues are usually smaller than one. But, 
if all eigenpairs are used, the robustness may decrease, as we usually have 
A > > Aj. However, in our experiments, we found that even when all 
eigenpairs are involved, a high-order rational function still helps improve per- 
formance. Two points may support our observation. First, > A; actually does 
not increase much compared with >> Aj. For example, in our experiments, we 
found >> Aj usually increases the value by no more than 25 percent. Second, 
with a high-order rational function, ¢;(-) penalizes the large eigenvalues in a 
harsher way, effectively increasing the gap between the scores of relevant and 
irrelevant features, which makes the algorithm more robust to perturbations. 
Therefore, in reality it is reasonable to use high-order rational functions to 
improve performance. 
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Example 12 The effect of noise on the feature ranking functions 


Figure 2.13 shows how feature ranking functions fi (+), G2 (-), and 63 (-) 
are affected by noise. In the figure, £ is the normalized Laplacian matrix 
constructed in Example 7. Let £ be the noise-contaminated Laplacian ma- 
trix, and a = lez I, where a reflects how much noise has been added to 
the data. In the figure, the z-axis corresponds to a. To generate the plots, 
we gradually increase a from 0 to 1. The y-axis corresponds to the feature 
scores of each feature. For ĝı(-) and f2(-), a smaller feature score indicates 
more relevance. While for ¢3(-), a bigger feature score indicates more rel- 
evance. Six features are tested: F1, Fo, F3, Fa, fs, and Fg. Among the six 
features, Fı and F> correspond to the first and the second dimension of the 
2D Gaussian mixture generated in Example 6, and are relevant features. 
F3,..., Fg are randomly generated with their values following the standard 
uniform distribution. Hence, these features are irrelevant. 

We obtain three observations. First, among the three feature ranking 
functions, y3(-) is the most robust to perturbations. When a = 0.8, both 
pi(-) and pa(-) mix up relevant features with irrelevant ones, while y3(-) 
can still separate them.® Second, comparing to y1(-), p2(-) is more robust 
due to the fact that the scores returned by p2(-) offer bigger gaps between 
the relevant and the irrelevant features. We also notice that the feature 
scores returned by pi(:) and p2(:) actually change in a similar trend as 
the scale of the perturbation varies. It can be verified that the different 
behavior of yi (-) and pa(-) is caused by the mechanism of removing the 
trivial eigenpair, (A1,€1) = (0,D21/||D21\|), from consideration. Recall 


za R Ă a 7! 
that pal )=Ê L£ f; ( — (f «) ) . It turns out that the relevant features 


are usually far from €,°, and therefore have denominators near 1. In con- 
trast, the irrelevant ones are often relatively closer to £1 and have smaller 
denominators. This difference effectively increases the score gaps between 
the relevant and irrelevant features. Third, a high-order rational function 
also helps increase the score gap. From the chart, we can observe that al- 
though pi (A) mixes up the relevant with irrelevant features when a = 0.53, 
with the help of y(-) = X5, ¢1(A) does not mix up the relevant with the 
irrelevant until a ~ 0.67. A similar effect can also be observed on the other 
two feature ranking functions, although these trends are less evident. 


6When a = 0.8, the scores generated from p3(-) for the six features are 0.0206, 0.0501, 
0.0096, 0.0154, 0.0023, and 0.0049, respectively. 
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yi(L) p2(£) 


FIGURE 2.13: (SEE COLOR INSERT) Effects of noise on the feature 
ranking functions. 
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2.7 Discussions 


SPEC is a unified spectral feature selection framework for supervised, un- 
supervised, and semi-supervised feature selection. It is based on three univari- 
ate formulations for feature evaluation and is of a filter model. We show that 
families of effective algorithms can be derived from the framework. We con- 
duct robustness analysis based on perturbation theory. The analysis enables 
us to obtain better understanding of the behavior of the SPEC in a noisy 
learning environment. 

The proposed framework consists of three components: the similarity ma- 
trix S, the ranking function ¢(-), and the spectral function y(-). A proper 
configuration of the framework ensures good performance. Based on our exper- 
imental results and observations, we offer the following guidelines for configur- 
ing SPEC: (1) The similarity matrix depicts the relationship among samples. 
A matrix which reflects the true relationships among samples is important for 
SPEC to select good features. (2) In noisy learning environments, either (P3(-) 
or a high-order rational function y(-) is helpful for removing noise. (3) For 
data with a clear spectrum gap, using £3(-) may be very effective. Otherwise 
Pa(-) could be more promising. Compared with ¢3(-), Pa(-) is less aggressive 
and usually provides robust performance. (4) SPEC generates feature weight- 
ing algorithms. However, most feature weighting algorithms do not consider 
feature redundancy, which may hurt learning performance [212]. To address 
the problem, we will propose a multivariate formulation for spectral feature 
selection in Chapter 3. 


Chapter 3 


Multivariate Formulations for Spectral 
Feature Selection 


Redundant features are those that are relevant to the target concept, but 
their removal has no negative effect. Usually, a feature becomes redundant 
when it can be expressed by other features. Redundant features unnecessarily 
increase data dimensionality [89], which worsens the learning performance. 
It has been empirically shown that removing redundant features can result 
in significant performance improvement [69, 40, 56, 210, 6, 43]. In the last 
section, we introduced the SPEC framework for spectral feature selection. We 
notice that the feature evaluation criteria in SPEC are univariate: features 
are evaluated individually, therefore the framework is not capable of handling 
redundant features. 


Example 13 An example of a redundant feature 


Assume we have three features F}, Fo, and F3. Among the three features, 
F can be expressed as a linear combination of Fi and F»: 


F3 = aL, + bF>, a,b € R. 


Given any function containing the three features, we can write it as a 
function that contains only Fi and F»: 


6 (Fi, Fa, F3) = 6 (Fi, Fo, aF, + bF2) = ¢' (Fi, Fo). 


Therefore, in this case, F} is redundant due to the existence of F; and F>. 


Spectral feature selection can handle redundant features by evaluating the 
utility of a set of features jointly. In this chapter, we study two multivariate 
formulations for spectral feature selection, one based on multi-output regres- 
sion [72] with an L2 j-norm regularization, and the other based on matrix 
comparison. We analyze their capabilities for detecting redundant features, 
and study their efficiency for problem solving. Before we present the two for- 
mulations, we first study an interesting characteristic of the SPEC framework, 
which we introduced in the last chapter. We show that SPEC selects features 
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by evaluating their capability of preserving the sample similarity specified by 
the given similarity matrix S. Based on this insight, we present two multivari- 
ate formulations for spectral feature selection. 


3.1 The Similarity Preserving Nature of SPEC 


As shown in Chapter 2, given a similarity matrix S, SPEC selects features 
aligning well with the top eigenvectors of £. Here £ is the normalized Lapla- 
cian matrix derived from S. This fact brings us to the conjecture that if we 
construct a new sample similarity matrix K, using the features selected by 
SPEC, K should be similar to S, in the sense that if the two samples are sim- 
ilar according to S, they should also be similar according to K. To precisely 
study the similarity preserving nature of SPEC, we reformulate the relevance 
evaluation criteria used in the SPEC framework in a more general form: 


max 5 T = max ( 5 f! S e) fe R”, SERA (3.1) 


FEF sup FEF suv 


Basically, we want to find a set of selected features, Fsub, such that the 
objective specified in Equation (3.1) can be maximized. In the above equation, 
f and S are the normalized feature vector and the normalized sample similarity 
matrix derived from f and S, respectively. It is shown in [155] that solving the 
following problem 


maxk»0 Trace (KS) st. Trace (kK) < 1, (3.2) 


will result in a kernel matrix K, which preserves the sample similarity specified 
in S. Here, the constraint K > 0 requires the matrix K to be positive semi- 
definite. We can write Equation (3.1) in the form of Equation (3.2), 


max) ner, Ê Sf = max ) pop ES Trace 6 fs) 


Fsub 


= max Trace (E a f iT) 8) ; 


sub 


We also have > pe E E fl = Xp, Xr.: Here Xp,,,, is the data con- 
taining only the features i in PL hie we have the following equation: 


pete = i 
max), y Î Sf = max Trace (E... XP) 8) ' (3.3) 


This equation shows that max PE... f'Sf will select a set of features 
sub 
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Fsub, such that the linear kernel constructed from Xp,,,, 
pairwise sample similarity specified in S. In other words, we can say that the 
features in Fsup have a strong capability of preserving the pairwise sample 
similarity specified in S. We can also show this in a more intuitive » Way: since 
f' Sf = S >, ŝu fifi, assuming that features are normalized (|| Ê || = 1), to 
obtain a large value from Equation (3.1), a feature must assign similar values 


can preserve the 


to the samples that are similar according to S. This ensures that the feature 
has the strong capability of preserving the sample similarity specified in S. 


Example 14 Measuring consistency between matrices 


Trace (KS) can be used to measure the consistency between matrices. To 
show this, we generated a two-dimensional data set with three classes, 
whose distribution is shown in Figure 3.1(a). We then generate noise- 
contaminated data sets by adding different levels of noise to the data set. 
Figure 3.1(b), (c), and (d) correspond to the data sets containing 30%, 60%, 
and 90% of noise, respectively. We construct linear kernels on both the origi- 
nal data set and the noise-contaminated data sets, and compute Trace (KS) 
to measure the consistency between matrices. Here S is the linear kernel 
constructed on the original data set, and K is the linear kernel constructed 
on either the original data set or a noise-contaminated data set. From the 
figures the similarity relationships among samples are perturbed propor- 
tionally to the level of the noise added to the data. And correspondingly, 
the value of Trace (KS) decreases. When K = S, Trace (KS) = 1.282, while 
when K is constructed on the data set containing 90% of noise, Trace (KS) 
decreases to 0.7994. 


In the following, we show how SPEC can be reformulated in the form of 
Equation (3.1). We first study a simple case, in which the spectral matrix 
function, y (-), is not applied. Using the following theorem, we show that with 
different definitions of f and $, the three feature ranking functions ¿y (-), 
pa (-), and p3 (-) can be written in a common form: max Ps fT SÊ. 


Here, f and S are the normalized versions of f and S. 
Theorem 3.1.1 Let S be a similarity matrix, and D and £ be its degree and 


normalized Laplacian matrices, respectively. SPEC selects k features, which 
maximize the following objective function: 


arg MAX F,, ,...,Fi, 


(3.4) 


iM- 
h> 
— 
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FIGURE 3.1: Measure consistency between matrices via Trace (KS). 
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When pi (-) is applied, f and $ are defined as 
E Def X 
f i S=D-:sD-:. (3.5) 
Dé" 
When Pa (-) is applied, let €, be the first eigenvector of L, which is equal 


o Zi in f and § are defined as 
ID? 1l 


When (3 (-) is applied, let L = UXU! , where U = (&,...,€n) and Y = 
diag[(A1,..., Àn) are the eigen decomposition of L. f and S are defined as: 


1 
îi aa re 
¡Die 
where Up = (€2,...,€x), Xx = diag(A2,...,Ax)- 


= U; (21- YE,) U}, (3.7) 


Proof: We start from pi (-). It is easy to verify that pa (F) = f° (1 — 5) f. In 


SPEC, features are evaluated independently; therefore, using y (-) to select k 
features can be achieved by picking the top k features that have the smallest 
pa (-) values. This process can be formulated as the following optimization 
problem: 


k 
argming,, pu, YO ft (1- 8) Ex 
j=l 
Note that in the above equation, features are evaluated independently. Addi- 
tionally, 


k 
arg minr,, „Fi, 5 ÊT (1-8) Ê; = arg MaXp,,,..., Fi, 5 E S Ea 


j=1 j= 


= 


Here we assume the features have been normalized, therefore all have unit 
norm. 

In the case of (a (-), it is easy to verify that ||f|| = 1. Since the first eigen- 
value of the normalized Laplacian matrix £ is always zero, we have | £x=0 
for any x € R”*!. Based on these two facts, we can verify that Equation (3.6) 
holds. 

Similarly, in the case of (3 (-), the following equation holds: 


k 
FTU; (21 — Er) Unf => (2-Aj)aj = p3(F). 


This proves the equivalence when 93 (-) is used. E 
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Theorem 3.1.1 shows that when pi (-) or p2(:) is used, SPEC tries to 
preserve the sample similarity specified by D-:SD-2 „which is the normalized 
sample similarity matrix. When 3 (-) is used, SPEC tries to preserve the 
sample similarity specified by Uz (21 — £) U}, which is derived from £ by 
adjusting the leading eigenvalues and discarding the tail eigenpairs. In y (-) 
and pa (-), the features are first reweighted by D2f, which forms the density 
reweighted features [74]. And they are then normalized to have the unit norm. 
This step emphasizes the elements in a feature vector, which correspond to the 
samples from a neighborhood with dense sample distribution. In (pa (-), there 
is an additional orthogonalization step: features are made to be orthogonal 
to €;, and then normalized to have unit norm. This step removes £ from 
consideration. As we mentioned, by aligning closely with €,, a feature can 
achieve a large pi (-) value. However, €, only captures the density information 
of the data. The orthogonalization step ensures that we will not assign high 
relevance scores to features that align closely with £;. 

Similarly, when the spectral matrix function  (-) is used in SPEC, using 
the following theorem, we show that the three feature ranking functions fa (-), 
2 (-), and ¢3 (-) can also be formulated into the form: max IFEF FT SÊ. 

sub 
Theorem 3.1.2 Let S be a similarity matrix, and D and £ be its degree and 
normalized Laplacian matrices, respectively. Also let y (-) be a spectral matrix 
function. SPEC selects k features that maximize the objective function: 


k 
arg maxr, ,..., Fs, D ÊT Ss Ei (3.8) 


3 
j=1 


Let L = USU!, where U = (€,,...,€n) and E = diag(\1,.-.,An)- 
When ĝı (-) is applied, f and $ are defined by 


A D>f A 
f =D Da ŝ=U(I-7(®)) UT. (3.9) 
2 
Let €, be the first eigenvector of L. When (a (-) is applied, f and $ are 
defined as 
TE e E a 
Da A 


When 3 (-) is applied, f and $ are defined as 


S=U(I-y(2)U”. (3.10) 


o 8 
Dèt 
U; = (£2,..., Ek); E, = diagl(A2,..., Ax). 


= U; (y (21) — y(Ze)) Uf, (3.11) 
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Theorem 3.1.2 can be proved in a way similar to Theorem 3.1.1. There- 
fore we omit its proof. Theorems 3.1.1 and 3.1.2 together demonstrate the 
similarity preserving nature of SPEC. 

In SPEC, features are evaluated independently. A direct consequence of 
this is that redundant features cannot be properly handled by the SPEC 
framework. Redundant features unnecessarily increase dimensionality and can 
worsen learning performance. They need to be removed in the feature selection 
process. To achieve this, we propose in the following sections two multivari- 
ate formulations for spectral feature selection, which are able to evaluate the 
utility of a set of features jointly. We show that the multivariate formulations 
for spectral feature selection can identify redundant features effectively. 


3.2 A Sparse Multi-Output Regression Formulation 


Let X € R"*™ be a data matrix, where n and m are the numbers of 
samples and features, respectively. Given a sample similarity matrix S and a 
feature F, different feature evaluation criteria in SPEC can be formulated in 
a common form as 


oF, s) = iTS P=) °K FE =D (FEOS 12) 


where f is the normalized feature vector, S is the normalized similarity ma- 
trix, and Îi and & are the i-th eigenvalue and eigenvector of S, respectively. 
Equation (3.12) shows that these criteria evaluate features individually, and 
are therefore unable to identify redundant features. 


La 
Let y; = A? &, Equation (3.12) can be written as 


(F, S) = $ (A787 J = y (Py) (3.13) 


1= g 


To identify redundant features, the features must be evaluated jointly. Instead 
of looking for features that are close to y;, as formulated in the above equation, 
we should find a set of l features, such that their linear span [77] is close to 
y;. This idea can be formulated as 


arg min ly: — Xawi,all3, 
Wi, A 


where A = {i1,...,a} C {1,...,m}, Xa =(f;,,...,f;,), and w; € R!*+. Note 
that in the above equation, we apply the Lə norm on the difference of two 


AA 
vectors to measure their closeness. When all y; = A; é; are considered, their 
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joint optimization can be formulated as 


n 
arg min Y ly; — Xawiall2 = arg min [Y — XA Wal? (3.14) 
A,wi,a i=l A,wa 
In the above equation, Y =(y1,...,Yn), Wa=(wi,a,---,;Wn,a). Assuming 


that S = UXU" is the SVD of $, we have Y=UD!/2, And ||- ||r is the 
Frobenius norm [59], which is defined as 


IRI = 4/ Trace (RT R). (3.15) 


We noticed that when A contains only one feature, the formulation reduces 
to searching for features that maximize the Equation (3.12). 

Given Y and XA, the optimal Wa in Equation (3.14) can be obtained in 
a closed form. However, feature selection needs to find the optimal A, which 
is a combinatorial problem being NP-hard. To make the problem solvable, we 
can approximate the problem as [9, 44, 95, 126, 132, 214, 218, 116, 115] 


x 2 
arg min [Y — XW | +A [PWI 1 
st. A={i:||w'||, > 0}, Card (A) =], (3.16) 


where Card (A) returns the cardinality of the set A, wê denotes the i-th row 
of W, and ||W]|, , is the L2, ¡-norm, defined as 


WI = Doll, (3.17) 
¿=1 


When applied in regression, the Lo ¡-norm regularization is equivalent to 
applying a Laplace prior [153] on wt, which tends to force many rows in W 
to become zero row vectors and results in a sparse solution. The effect of the 
Lo 1-norm constraint can be demonstrated by the following example. 


Example 15 The effect of the La 1 -norm regularization 


In this example, we randomly construct a data set containing 200 sam- 
E ia aa 
ples and 200 features, X € IR200x200. and (17.4) sea (Afoo: £100) are 


used to form Y, Y € R200*10%. We try different A values for the regular- 
ization and plot the obtained W in Figure 3.2. In the figure, a light spot 
corresponds to a nonzero element in W. It shows, when A = 0 (no regular- 
ization is applied), all features are selected. Since a larger ||W ||, , applies a 
more severe penalty on a dense solution, when we increase A from 0 to 500, 
more and more rows of W are set to 0. Because each row of W corresponds 
to a feature in X, by setting many rows of W to zero, we remove many 
features from consideration, which helps us achieve feature selection. 
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There are three advantages of the formulation presented in Equa- 
tion (3.16): 


(1) It can find a set of features jointly preserving the sample similarity 
specified by S. 


(2) By jointly evaluating a set of features, it tends to select nonredundant 
features. 


(3) The problem specified in the formulation is tractable. 


First, by the following theorem, we show that the formulation can find a 
set of features jointly preserving the sample similarity specified by S. 


Theorem 3.2.1 Let S = UXU!, Y = UD!”? and Q = Y — XW. We have 
|xww TXT — Sl], < 2 Vip + Oz) lOllp- 


Proof: Since S = YY' and ||A + Blir < ||Al|7 + Bir, we have 


|IKWWIX'-yy'|p = |ay'+yvo0'+00"|p 
OY" lle + Yo [7 + OQ" le 
2 Y lle Ol + NO 

= QllYle + (107) 19] 7. 


IN IA 


In the derivation, we use the inequality ||ABlr < [Al [Bl 7. 
a 


In the above theorem, XW is a new representation of samples obtained 
by linearly combining the selected features. And XWW'X' computes the 
pairwise similarity among samples measured by their inner product under 
this new representation. The theorem shows that by minimizing ||Q||7, we 
also minimize |XWW!X! — S||r, which ensures the selected features can 
jointly preserve the sample similarity specified by S. 

Second, the formulation tends to select nonredundant features. Assume 
two features f, and f, satisfy the following conditions: (1) they are equally 
correlated to Y, i.e., f Y = EY (2) f4 is highly correlated to fa, i.e., £ fa > 
1, and f, is less correlated to fa, i.e., Aj fa > A fa (without loss of generality, 
we assume both fp and f} are positively correlated to fy); (3) they are equally 
correlated to other features, i.e., f f; = f} fi, Vi € {1,...,m}, i Æ d. Based 
on these assumptions, we attain the following theorem. 


Theorem 3.2.2 Given the above assumptions, assuming fa is selected by an 
optimal solution of Equation (3.16), then fa has higher priority than fp to be 
selected in the optimal solution. 


1Note that although W € R™**, many of its rows are 0! . Therefore, the representation 
is generated by using only a small subset of selected features. 


64 Spectral Feature Selection for Data Mining 
A = 0, | = 200 A = 125, 1 = 38 


\ = 250, 1 =15 A = 500, 1=4 


FIGURE 3.2: Different A values for Lə ¡-norm regularization and their corre- 
sponding sparse solutions. l is the number of rows that have nonzero elements. 
As A becomes larger, more rows of S are set to 0. Note: each row of W cor- 
responds to a feature. 
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Proof: Let Y = (y1,::* ,yx) be the n x k target matrix, and W be the 
m x k weight matrix. The i-th row and j-th column of W are denoted 
by W, and W.,, respectively. Recall that || - ||r is the Frobenius norm 
and |W] = >, ||Wi:|l2 is the L21 norm. Let W be the current so- 
lution in which two strongly correlated features fa and fp are selected, i.e., 
|Wa.llr > 0, ||Wp:||r > 0. Using the technique developed in [144], we can 
show that in the optimal solution of Equation (3.16) (Wa:, W,.) > 0 when 
f] fp > 1. Assuming the three conditions specified above hold, we show that as 
long as f, has a sufficiently small correlation with fy, selecting f, rather than fp 
always decreases the objective function. To this end, we define another weight 
matrix W as: W;. = W, for V i Æ p,q, W, = W.:, and W>: = 0. Note that 
(1) IIWll2 1 = ['Wll2 1; and (2) W and W have no difference except the p-th 
and q-th rows. Since ||Y — XW||} = E lly; — XW.,||3, we can show that 


IY — Wiz — [Y — XW] =2W,, YT (E, —£,) + Was, Wp) (pap — Pag), 


where, pi; = f; £j. In the derivation, we rely on the fact that 


l l 
S Woy; = Wo. Y", X WaW = (Wa, Wp). 
j=1 j=1 


Based on the equation, we reach the inequality 
IY - XW]|} — |Y - XW] > 0 & 
1 
(Pap ai Pag) > (Wa, W,.)) W,:Y' (fp 7 f,) - 


[Y £p — Y'f,||2 = 0, according to the assumption. As far as Pag < Pap, 
selecting fq rather than fp will always decrease the objective function. 


The theorem shows that the formulation in Equation (3.16) tends to select 
features that are the least correlated, ensuring the selection of non-redundant 
features. 

Third, the problem specified in Equation (3.16) is tractable. Given a value 
for A, we obtain 


arg min [Y — XWllp +A Wa - (3.18) 


This problem can be solved by applying a general solver [133, 8, 115]. And 
given l, the number of features to be selected, its corresponding A value can 
be found by applying either a grid search or a binary search. Usually, a larger 
A value results in selecting fewer features. 

Below, we present three efficient solvers for the multivariate spectral fea- 
ture selection problem defined in Equation (3.16). The first two are based 
on solving the Lə ¡-regularized regression problem defined in Equation (3.18). 
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However, for a given l, using the two approaches, we still need to try many 
different A values to determine the proper regularization parameter, which re- 
sults in the selection of exact l features. This requires us to run a solver many 
times, which is computationally inefficient. The third solver is a path-following 
approach [205, 73, 48] based on solving the multivariate spectral feature selec- 
tion problem defined in Equation (3.16) directly. Therefore, the solver is more 
efficient comparing to the first two. Note, in Equation (3.18), we assume A 
is given, while in Equation (3.16) we do not assume this. Finding the proper 
regularization parameter A is part of the objective defined in Equation (3.16). 


3.3 Solving the L>¡-Regularized Regression Problem 


Assuming the regularization parameter A in Equation (3.18) is known, the 
two most efficient solvers for the Lə ¡-regularized regression problem specified 
in Equation (3.18) are the coordinate gradient descent method [185, 29, 186] 
and the accelerated gradient descent method [127, 12, 115, 116], and they are 
both iterative methods. The difference between the two methods is that in 
each iteration, the coordinate gradient descent method picks one row of the 
weight matrix W to optimize, while the accelerated gradient method optimizes 
the whole weight matrix. Although the two methods use different strategies 
for optimization, they are based on solving the same model function, which is 
defined as 


1 
Mw, (W) = 3 IX WS — Y Il + Al Who. 


+ Trace (xw; - Y) X(W; — w)) 


dp “i Trace ((w, —w)' (Ww; — w)) l 


In the above function, W; and A are given, and W is unknown. 
Mw, (W) forms a quadratic approximation of the original objective func- 
tion defined in Equation (3.18). 

Let 


1 
loss (W) = 5 [Y = XWI'2 AY la. 
We can verify that 


| XW — Y [7 


= T E 
AW =2X" (XW - Y). 
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Therefore, Mw,,, (W) can be rewritten as 
Mw, (W) = loss(W;) + Trace (loss! (w;)' (W — W;)) + Al|W]lo4 
L; 
+ Trace ((w, -W)' (Ww; — w)) : (3.19) 


The above equation shows that the first two terms of the model function 
Mw, (W) form the first-order Taylor expansion of |Y — XW||; at the 
point W). [Y — XW]? is differentiable anywhere in the domain of W, and 
|W'||2, is not differentiable at the points w' = 0, ¿=1,...,m. Therefore, in 
loss (W), [PY — SAA forms the smooth part, and A ||W ||, forms the non- 
smooth part. We put the nonsmooth part of the loss function directly into 
the model function. The regularization term Trace ((w, —w)' (W; — w)) 
puts a constraint on the distance between W; and W, and prevents W from 
being too far from W,. This term ensures that the model function can be a 
good approximation to the original loss function in the neighborhood of Wj. 

Let w’ be the i-th row of W. We can also write Equation (3.19) as 


1 = 
Mw,,,(W) = 5IXW;- Yl +) <£ (Y —- XW)),w) - w > 
i=l 
L; ~ i i i i 
+ AWllo. + coe < wi —wi,wi — wi >, (3.20) 
i=l 


where < - > is the inner product operator on vectors, < x,y >= x! y. It is 
easy to see that when W* is the optimal solution of Eq. (3.18), we have 


w* = arg min Mw, (W) (3.21) 


Both the coordinate gradient descent method and the accelerated gradient 
method try to generate a sequence of W; to approach the optimal solution 
of Equation (3.18). Assuming W; is the current solution, the next solution in 
the sequence is obtained by solving the following problems, 


arg min Mw, (W), Accelerated gradient descent method; (3.22) 
arg min Mw, (W), Coordinate gradient method. (3.23) 


We will study how the sequence is generated in detail later. 
It turns out that Equation (3.22) and Equation (3.23) have closed form 
solutions. Let’s assume 


1 
V=W;- id (XW;- Y), p= (3.24) 
J 


A 
i 
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Then we can reformulate arg min Mw, 1 (W) in the following form: 


: paik 
arg min Mw, a (W) = arg min 5 |W = VI + pW; (3-25) 


This equation can be simply verified by plugging V and p on the right side of 
the equation. Since || A||} = Trace (A! A) = $, a", where a’ is the i-th row 
of A, we have 


. ; tdo la i 
arg min Mw, (W) = arg min (E 3 lw -= v'||5 + pllw i) . 


i=l 


Therefore, the minimization of wt is independent of the other rows of W, and 
the original problem can be decomposed into m independent sub-problems: 


1. 
arg min PUMA —v' [3+ pl], ¿=1,...,m. (3.26) 


To obtain the minimizer of the model function Mw, 1 (W), we notice that 


wi — vil is differentiable in the domain of wi, and its gradient is given by 


oi — vile 
Since Iwll , is not differentiable at 0, it has only subgradient [18, 131], which 
is given by i 

oliw llaa _ w when w’ 40 
ui) lwl : 


u € R!** Jul <1, when w=0 


(3.28) 


Since Mw, 1 (W) is convex, according to convex optimization theory [18], we 
know it has a unique minimal, and its optimal solution can be obtained by 
solving the problem 

Mw, (W) 

Ow? 

By plugging the following equation back into Equation (3.29), we can verify 
that it is the optimal solution of Equation (3.29). Therefore, it is also the 
optimal solution of Equation (3.26). 


=0. (3.29) 


z 
II 


i vi|1— ——), when |lv'|lo > 
(= pa) ll II 0 
0, when ||v'|lo <p 
Below we present the coordinate gradient descent method and the accel- 


erated gradient method for solving the La ¡-regularized regression problem 
defined in Equation (3.18). 
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3.3.1 The Coordinate Gradient Descent Method (CGD) 


In the coordinate gradient descent (CGD) method, we generate the so- 
lution sequence by iteratively updating the rows of W. This boils down to 
solving the sub-problems of Equation (3.25), which is defined as: 


gael afl od Me 
arg min wi — vi][3 + plwélla, ¿=1,...,m. 
wi 2 


The pseudo-code of the coordinate gradient descent method is shown in 
Algorithm 2. The method contains two major steps: (1) Line 2-Line 15, update 
each row of the weight matrix W, wi; and (2) Line 16-Line 21, test whether 
the solution converges.” 


Algorithm 2: The coordinate gradient descent method 
Input: X, Y, A, Wo, Lp > 0 
Output: W 


1 W=Wo, j=1; 
2 fori=1...m do 
3 for L = Lj 19 2L; 13 AL; 13 ... do 
4 vi = wi — tf! (XW-Y), p=A/L; 
5 w = argmin 5 ov — vi [2 + pllw|: 
6 if [Y — XW ariel + Wui=w!l2 1 € Mw, a (Wwvi=w) then 
7 | break; 
d = w’ — w; 
if d 40 then 
10 a + line search; 
11 | wê = w’ + ad; 
12 =3+1, Lj = L; 
13 if converge then 
14 return W; 
15 else 
16 | goto line 2 


The first step of the algorithm can be further divided into two sub-steps: 
(a) In Line 2-Line 9, we compute the w for updating wê based on the 
current solution. To ensure the validity of the w obtained in Line 5, L in 
Equation (3.19) should be large enough so that Mw, , (Wwi=w) can bound 
[Y — XW yiewll% +A Will , from above. (b) In Line 10-Line 14, in- 
stead of replacing wê with w, we adopt a more sophisticated way for updating 
wi: we first compute the update direction d = wê — w (Line 12), then use line 


2The convergence of the solution can be measured by the norm of difference between 
W; and Wi+1, e.g., || Waa = Wille < E. 
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search to determine how far we should go with this direction, and update wi 
using wi = wi + ad (Line 13). The reason for doing this is that using w to 
replace wê can be too aggressive, which may interfere with the convergence. 
The pseudo-code of the line search procedure is shown in Algorithm 3. 


Algorithm 3: Line search 
Input: X, Y, A, W, i, de R'**, ag > 0, 5>0, 0 >0 
Output: a 

1 A=d" (Y— XW") £ + Alw + dla + la — Alw“ lla; 


2 for a = ao, da, 5209, ... do 
3 if f (Wyi-witaa) — f (W) < acA then 
4 Ñ break; 


5 return a; 


Algorithm 3 presents an inexact line search procedure using the Armijo 
rule [131]. In the procedure, ao > 0, 9 > 0, o > 0 are input parameters, and 
A is the improvement of model function by using w = wê + d as the new 
solution. 

It is shown in [186] that Algorithm 2 has a local linear convergence rate 
under certain conditions. Although the algorithm is reported to be very ef- 
ficient [213] in real-world applications, its global rate of convergence is still 
unknown. 


3.3.2 The Accelerated Gradient Descent Method (AGD) 


In the accelerated gradient descent (AGD) method, we generate a sequence 
of solutions by iteratively updating the W. The pseudo-code of the accelerate 
gradient descent method is shown in Algorithm 4. The method is based on 
generating two sequences: (W;);_, 2... (Lines 4-9) and (S;);-¡ 2... (Line 3). 
{W; ha be is the sequence of approximate solutions, which asymptotically 
approaches the optimal solution of Equation (3.18). And {S, ión is the 
sequence of the search points. A search point S; is the affine combination 
of W; and W;_1 And W; is computed by minimizing the model function 
Ms,_,,, (W). In Line 6, we check whether the L in Equation (3.19) is large 
enough, so that Ms,, > (W;+1) can bound [Y — XW all +A Wi+llo , 
from above. 

As shown in [116] the accelerated gradient method has a global convergence 
rate of O(442), where M is the number of iterations. Let m be the number of 
features, n the number of samples, and C the number of columns of Y. The 
time complexity of the accelerated gradient method is given by 


O (mnO Mlz), (3.31) 


where M is the maximal number of iterations specified in Algorithm 4, and 
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Algorithm 4: Accelerated gradient descent method 
Input: X, Y, A, Wo, Lo > 0, M 
Output: Wyr+1 

1 Wi; = Wo, A—1 0, Qo T; L Lo; 

2 for j = 1...M do 

3 | B;= 2, Sy = W + By (Wy — Wj); 

4 for L = L; 1, 2L; l; AL; 13 ... do 

5 Wu = arg min Ms, , (W); 


6 if PY — XI Wa 1% +A [Wi+1ll, < Ms, (W541) then 
7 L break; 


Ir /1+402 | 


8 | Li=L, Qj+1 = 2 ; 


9 return Wyr+1; 


lz is the averaged number of ties for searching the proper L in the validation 
process. The above equation shows that the time complexity of the accelerated 
gradient method is linear in terms of the number of samples and the number 
of features. Therefore, it is very efficient. 


3.4 Efficient Multivariate Spectral Feature Selection 


The two methods presented in the last section solve the Lə ¡-regularized 
regression problem specified in Equation (3.18), which has the form 


arg min |Y — XW ||; +A WI, - 


It may be inefficient for us to use them to solve the multivariate spectral 
feature selection problem specified in Equation (3.16), which has the form 


: 2 
arg min [Y — X WII; +A WI, 4 
st. A= {i:||w'||, > 0}, Card (A) =1. 


In the above formulation, A is not given. Instead, we are given l, the num- 
ber of features to be selected. One can solve this problem indirectly by solving 
Equation (3.18). However, this will not be efficient, since it requires us to solve 
Equation (3.18) multiple times for searching a proper A value, which leads to 
the selection of exact | features. Here we present an efficient path-following 
solver for the problem specified in Equation (3.16). It can automatically detect 
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the points when a new feature enters its active set, and update its param- 
eters accordingly. It can efficiently generate a solution path for selecting the 
specified number of features. 

We start by deriving the necessary and sufficient conditions for a feature to 
have nonzero weight (i.e., being selected) in an optimal solution of Equation 
(3.18). 

Let 

Loss (W, A) = [Y — XWIlz, — A||Wllo,.- 


We notice that Loss (W, A) is convex, but it is nonsmooth when w! = 0. Ac- 
cording to the convex optimization theorem [18], W. minimizes Loss (W, A) 
if and only if: 


0 € Ou: Loss (W,A)lw=w,, ¿=1,...,m. 


Here, Oy: (W, A) is the subdifferential of Loss (W,A) on wt, and has the 
following form: 


Owi Loss (W,A) = £;! (Y — XW) + Av; 


w? 
[wėl 
v; € (ulu € RX, Jul]? < 1}, if w'=0. (3.32) 


if wi 40 


Therefore, W, is an optimal solution if and only if: 
—dv; = f; (Y — XW) |wew,, Vi € {1,...,m}. (3.33) 


Based on this observation, we give necessary conditions (weak), and both 
necessary and sufficient conditions (strong) for W to be optimal. We show that 
a suboptimal solution, which satisfies the necessary conditions, can be easily 
obtained. And the obtained suboptimal solution can be efficiently adjusted 
to generate an optimal solution, which satisfies the necessary and sufficient 
conditions. 


Theorem 3.4.1 Assuming wi is the i-th row of W, the necessary conditions 
for W to be optimal are: Vi € {1,...,m}, 


w 40 = ET (Y—-XW)|l2=A 
wW=0 = FT (Y -XW)llo <A. (3.34) 


The theorem suggests that for a feature to be selected, its correlation to 
the residual, R = Y — XW, must be equal to A. Here, the correlation between 
a feature and the residual is measured by ||f!R.||2. A feature is selected if it 
has a nonzero weight vector in the optimal solution. The above property is also 
called the equal correlation condition for these features, since they are equally 


3An active set contains the indices of the features that have nonzero weights in W. 


Multivariate Formulations 73 


correlated to the residual. A solution satisfying this condition can be easily 
obtained via applying a forward stepwise search strategy, which is similar to 
that introduced in the least angle regression (LARS) [48]. 


Theorem 3.4.2 Assuming w' is the i-th row of W, the necessary and suf- 
ficient conditions for W to be optimal are: Vi € {1,...,m}, 


w40 > f (Y-XW)= 


I w*l|2 


wW=0 => If (Y -XW)ll2 <A. (3.35) 


Algorithm 5: MRSF: Minimal redundancy spectral feature selection 
Input: X, Y, / 
Output: W 
Wl = 0, A, =+00,¿=1 and R!!! = Y; 
Compute the initial active set: Ay = arg max; |E RO 2: 
while i < | do 
Compute the walking direction ya;: Ya, = (E X4, RU; 
for each j ¢ A; and an arbitrary t € A; do 
Compute the step size a; in direction ya, for fj to enter Aj. 
| IET (RE — aX 2,704) llo = (1 — 0) If RE; 
7 j* = arg minjga, Qj; 
A , T T 
s | We (H tana) 0)” 


9 | A=A;U{i*}, A = 1- aje RU»; 
10 Solve the smaller optimization problem, 


aon A O N = 


ming, [Y — Xa W [2 + A;||W]ls,1, using a general solver with W as 
the starting point; 

11 R=Y- X,W; 

12 | if Vid A, |E Ro < A; then 


13 Ajai = Á, wit = W, Rit = R, 1=1%+ dl 
14 else 
15 Â = {i [8°] #0} U [y + IE Bille > ri}; 
16 Remove w’ from W, if ||| = 0, 
Ă a T 
W = (W7.0,....0) „ Goto line 11; 


= 


7 Extend WU to W by adding empty rows to Wl); 
s return WI: 


= 


Based on the above two theorems, we propose an efficient solver for mul- 
tivariate spectral feature selection based on Lə -regularized regression. Its 
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pseudo-code can be found in Algorithm 5. In the algorithm, A, is the “active 
set” of the i-th run, and contains the features selected in that run. Algorithm 5 
contains two major steps. 

1. In Lines 4-8, the algorithm determines ya,, the direction for updating 
Wl (Line 4), and aj», the step size for updating W! (Lines 5-8). It then 
updates the active set and computes the A; (Line 10). We can verify that 
when the regularization parameter is set to A;, the Ww generated in this step 
is a suboptimal solution satisfying the equal correlation condition specified in 
Theorem 3.4.1. 

2. In Lines 11-18, the algorithm finds an optimal solution corresponding 
to the A; obtained in step 1. Given A;, it first solves an L2 ¡-norm regularized 
regression problem using a general solver (e.g., either the coordinate gradient 
descent method or the accelerated gradient method) (Line 11). Note that this 
problem is of much smaller scale, since it is based only on the features in the 
current active set, but not the whole set. For example, in the i-th iteration, 
there are only i features in the active set, and i < m. Also W is used as 
a starting point to accelerate the convergence of the solver. It then checks 
whether the obtained solution is also optimal on the whole data (Line 13) by 
using the conditions specified in Theorem 3.4.2. If it is true, the algorithm 
records the current optimal solution and proceeds to Line 4 for the next run 
(Line 14). Otherwise, it adjusts the active set, updates the W to remove 
the unselected features, and makes space to accommodate the newly selected 
features. It then returns to Line 11 (Line 17). In this step, we adjust the sub- 
optimal solution obtained in the first step and compute the optimal solution 
corresponding to A;. 


Theorem 3.4.3 (1) Let Wlil be the optimal solution obtained in the i-th 
iteration. The W generated in step 1 (Line 9) satisfies the equal correlation 
condition specified in Theorem 3.4.1. (2) The Wli+!l computed in step 2 (Line 
14) is an optimal solution corresponding to Ai. 


Proof: To prove the first part of the theorem, let WI be the optimal 
solution obtained in the i-th step. Its corresponding residual and regular- 
ization parameters are RU = Y — Xa, WË and A;, respectively. When 


A : T A 
W = (wi + ajya) .0) and A = A; U{j*}, the corresponding residue 
can be written as 


i T a 
= RË —aj-Xa,7A,. 
Therefore, we have 


f; R = f RË — asf Xa ya, Yt E AU). 
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When t € A,, we have 
f RË art Xa ya, = (1 aj- )f RY. 


To obtain this equation, we need to show 


ET Xaa: = f Xa, (XL, Xa.) XA RÚ = f RM, vte Aj. 
Let XA, = UV" be the SVD of Xa. Since f; is a column of XA,, we have 
fı, = Ua, a=YNv', 


where vi is the q-th row of V, and f; is the q-th column of Xa,. Based on 
this equation, we can obtain the following equation, 


fT (Xs, (Xa XA) Xi.) = £ UEV (VE2V7) VEUT 


M = Xa, (XA Xa.) XA, € ¡R”*” defines a projection matrix, which 
projects any n-dimensional vector to the space spanned by the columns of 
X. Since f; is a column of Xa,, it is already in the space spanned by the 
columns of X. The above equation shows that in this case, the projection 
matrix M will project f, into itself. Based on this observation, for Vt € A, we 
can obtain the equation 


I&R = Ut (Y -Xa (WP + a5-74,)) ll 
= (ap) fT RE. 


Since ||f) R||2 = |f] Rl|2 for Vp, q € A; (due to the optimality of wii), we 
have 


If ROI = I£ RI], Vp,q € As. (3.36) 


When t = j*, ff will be the feature that is about to enter the active set. 
According to Line 6 of the algorithm, vt € A;, we have 


IR = If} (RU ay-Xa 75) le 
= (1-0y») IE RU 


By combining Equation (3.36) and Equation (3.37), we proved the first part 
of the theorem. 

The second part of the theorem can be simply verified by applying the 
necessary and sufficient conditions for an optimal solution specified in Theo- 
rem 3.4.2. 
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To select | features, Algorithm 5 runs at most l — 1 iterations to shrink the 
regularization parameter À from +00 to a proper value, so that exact l features 
can be selected. In each iteration, the algorithm decreases A by a certain 
amount to allow a new feature to enter the active set. Let Wl"! and A; be the 
optimal solution and the corresponding regularization parameter obtained in 
the i-th iteration. According to Theorem 3.4.1, we know that all the active 
features in W!’ are equally correlated to the residual: ||f! (Y — XW") ll = 
Ai. And all the inactive features are less correlated. In the (i+ 1)-th iteration, 


to activate a new feature, the algorithm first determines a direction ya In 


i+1° 
the proof of Theorem 3.4.3, we show that by updating W+! using aya,,,, 
the equal correlation condition always holds for W+! + aya,,,, Va < 0. 
However, we cannot set the a value arbitrarily. Let f, be an arbitrary feature 


in the active set. We find the a value by solving the problem 


ae = arg min (If? (RE — aXa, Yaa ) la = (1 JE REA). 


GEAG41 
(3.38) 
The obtained a* is the minimal step size for activating a new feature. It is easy 
to verify that when a < a*, all the inactive features are still less correlated to 
the residual 


Is; (Riu pi > RT lla < (1 că E REA o. 


And when a = a*, the inactive feature corresponding to a* starts satisfying 
the equal correlation condition, which provides us a hint that this feature may 
become activet when we update W!'+1 by a*ya,,,. Given the equivalence be- 
tween the feature-residual correlation and the A, shown in the Theorem 3.4.1, 
we set the value of A;+1 using the equation 


Aia = (1 aL REA. 


We also update the weight matrix and the active set to include the newly ac- 
tivated feature in the tentative solution. Note that the updated weight matrix 
only stratifies the necessary condition for an optimal solution. So it may not 
be the optimal one. Therefore, in the second step of the algorithm, we adjust 
this weight matrix to an optimal one. Note that it is possible that the optimal 
active set may be different from the one determined in the first step. 

The second step of the algorithm can be done efficiently. First, in Line 
11, we solve an L2 -norm regularized regression problem based only on the 
features in the active set, which is of a much smaller scale compared with 
that of the whole feature set; second, we compute the optimal solution based 
on the tentative solution obtained in the first step. It turns out that the 


“IF several features correspond to a*, they all satisfy the equal correlation condition, 
when a = a*. In this case, at least one of them may become active. 
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tentative solution is usually close to the optimal one. Therefore, the solver 
often converges in just a few iterations for solving the problem specified in 
Line 11. 

Let m be the number of features, n the number of samples, C the number 
of columns of Y, and | the number of selected features. The time complexity 
of MRSF is 


o( (ImnC + ?nCM1,) ly + (1% + ImC) n) (3.39) 


where M is the maximal number of iterations specified in Algorithm 4, lz 
is the averaged number of ties for searching the proper L in the validation 
process, and ly is the number of the backtraces for adjusting A in Lines 16 
and 17 of Algorithm 5. If assuming | < CMizly, Equation (3.39) can be 
simplified to 


o(inc (m + IMlL) tv). (3.40) 
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Example 16 An empirical study of the efficiency of MRSF 


We construct an artificial data set by randomly generating a data matrix 
X e R1000x10000 (1000 samples and 10000 features) and a target matrix 
Y e R1000x 10. 

For comparison, we first apply the accelerated gradient descent method 
(AGD) on the artificial data with different A values for regularization. The 
performance of AGD is reported in Table 3.1. From the table we can observe 
that as A decreases, more and more features are selected, which leads to 
smaller residual values. On the artificial data, on average, it takes 0.22 
seconds for AGD to run one iteration.? And on average, AGD converges in 
about 500 iterations. When A = 0.5, AGD selects 54 features, which takes 
about 122 seconds. 

We then apply MRSF to select 50 features on the artificial data. MRSF 
runs 40 iterations, which takes about 32 seconds, almost 4 times faster 
than the case for AGD on the whole data set. The residual of the solution 
is 9.79. 

Figure 3.3 shows the run time of each AGD solver call in MRSF, which 
takes 0.42 + 0.14 seconds on average. Figure 3.4 presents the number of 
iterations for each AGD call to converge. On average, it takes about 30 
iterations. This suggests that the run time of each AGD iteration is about 
0.02 seconds, which is 10 times faster than the time used by the AGD solver 
on the whole data. The reason is that in this case, the AGD solver runs 
on a much smaller problem containing only the features activated in the 
tentative solution (see Line 11 of Algorithm 5). We notice that the number 
of iterations for the AGD solver to converge is also much smaller in this 
case. It suggests that the tentative solution is close to the optimal solution. 

Figure 3.5 shows the number of features selected by MRSF in each 
iteration. It verifies that MRSF may select more than one feature in each 
iteration, if more than one feature violates the global optimal condition 
(Lines 16 and 17 of Algorithm 5) in the validation step. Figure 3.6 presents 
the number of AGD calls in each iteration. It shows that on the artificial 
data used in this experiment, on average ly, the number of the backtraces 
in the validation step of MRSF, is about 1.25, which is quite small. 

This example demonstrates that MRSF is an efficient solver for select- 
ing a specific number of features by generating a solution path. 


5The experiment is run on a computer with an Intel Core 2 Duo CPU and 4GB memory. 
The operating system is Mirosoft Windows 7 64bit. MRSF is implemented using Matlab. 
The AGD solver is downloaded from http://www.public.asu.edu/ jye02/Software/SLEP/, 
which also runs in MATLAB. 

SMRSF may select more than one feature in each iteration, if more than one feature 
violates the global optimal condition (Lines 16-17 of Algorithm 5) in the validation step. 


60 


50 


20 


o | 
.20 H 
.10 E 
10 
2 3 4 


Multivariate Formulations 79 
TABLE 3.1: The performance of AGD on the artificial data. 

A 1.0 0.9 0.7 0.5 0.3 0.1 0.0 
selected features 0 5 17 54 97 196 10000 
iterations 2 362 555 561 828 1017 441 
time (sec) 0.94 79.30 122.79 122.21 175.59 223.44 96.21 
time per iteration | 0.47 0.22 0.22 0.22 0.21 0.22 0.22 
residual 50.80 15.06 10.95 9.78 9.60 9.05 0.00 
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FIGURE 3.3: MRSF, run time of each AGD solver call. 


FIGURE 3.4: MRSF, number of iterations for each AGD call to converge. 
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FIGURE 3.5: MRSF, number of features selected in each iteration. 
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FIGURE 3.6: MRSF, number of AGD calls in each iteration. 


3.5 A Formulation Based on Matrix Comparison 


In spectral feature selection, we try to select features that can preserve the 
sample similarity specified in a given similarity matrix S. In Equation (3.16), 
this is achieved by solving a sparse multi-output regression problem, in which 
the weighted eigenvectors of S are used as the target. As shown in Theo- 
rem 3.2.1, using this formulation, we can find a set of features, such that the 
linear kernel based on the linear combination of the selected features is very 
close to S. When the Frobenius norm is used to measure the closeness of two 
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matrices, this problem can be formulated as: 
: Ty T 
arg min [xwWw X — sil, 
st. A= {i:||w'||, > 0}, Card (A) =1. 


We can simplify the problem as selecting a set of features such that the 
linear kernel obtained based on the selected features is close to S: 


min ||S — X¿X Al, (3.41) 
where X4 =(E,,.. Li), G € A, ¿=1,...,k. 


Treating ||S — XX || as a feature selection criterion, we can use it with 
the traditional forward search strategy for feature selection. The algorithm 
is named MCSF, which stands for Matrix Comparison for Spectral Feature 
selection. The pseudo-code of the algorithm can be found in Algorithm 6. Note 
that the card (-) in Line 2 returns the size of a set, and Lines 3 and 7 make 
use of the fact 


z Tag £T 
S-X4X! =S D Bta 
where XA = (fa, fia), 14€ A, j=1,...,k. 


The above equation shows that once a feature f; is selected, we can remove 
fif; from S, and use the obtained residual R in the later steps. This strategy 
eliminates the redundant computations and makes the algorithm efficient. To 
select | features, the algorithm needs to run / iterations. In each iteration, it 
greedily picks one feature that is most consistent with the current residue R, 
adds the feature to A, and updates the residue by R = R — ff. 

In each iteration, features are compared to R. Note that R is obtained by 
substracting the sum of the outer products of the selected features from S. 
So, features are not evaluated independently. And the formulation promotes 
the selection of uncorrelated features. 

The condition specified in Line 4 of Algorithm 6 guarantees that ||R||} 
monotonically decreases after each step. It can be shown that the time com- 
plexity of the algorithm is O (Imn?), where | is the number of selected features. 
Assuming features have been normalized to have unit norm, ||f;|| = 1, we have 
the following equation: 


arg min;ga |R — fif; ||} = arg max;ga £; Rf;. 


Therefore the problem specified in Line 3 can be solved in O(mn?) opera- 
tions. When / features are selected, the total time complexity of the algorithm 
is O(lmn?). 
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Algorithm 6: MCSF: Matrix comparison for spectral feature selection 
Input: f,,...,fm, S, l 
Output: A - the selected features 
A=9,R=S; 
while card(A) < 1 do 
î* = arg ming, |R — fif; 1%; 
if |R- £f [2 > RI then 
| return A; 
else 
L A=AU(*);R=R-ff,; 


Noa Pp WN HF 


00 


return A; 


3.6 Feature Selection with Proposed Formulations 


The proposed multivariate formulations can be used for both supervised 
and unsupervised feature selection. The key is how to construct the similarity 
matrix S in different learning settings. To address this, the methods introduced 
in Section 2.5.1 can be used. 

In a supervised learning setting, if Equation (3.16) is used for feature 
evaluation, it is more efficiently for us to construct the target matrix Y directly 
from the label information. For example, we can define Y as 


“vi oe (3.42) 


4/52 otherwise 
or 
Sle 1 Y=) 
Yi ={ -1 otherwise ` (3.43) 


Interestingly, in Section 4.2, we will show that when these Y are used in 
Equation (3.16), we can actually obtain sparse solutions for the Least Square 
Linear Discriminant Analysis (LSLDA) [204] and the Least Square Support 
Vector Machine (LSSVM) [178], respectively. 

In the next chapter, we will study the connections among spectral feature 
selection and some representative dimension reduction algorithms. We will 
show that spectral feature selection not only unifies many supervised and 
unsupervised feature selection algorithms, but also connects feature selection 
with feature extraction via its multivariate formulations. 


No = 4.3 x 10 M3 = 1.5 x 1074 


COLOR FIGURE 1.9: The contour of the second and third eigenvectors 
of a Laplacian matrix derived from a similarity matrix S. The numbers on 
the top are the corresponding eigenvalues. 
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COLOR FIGURE 2.3: Contours of the eigenvectors £1, £2, é3, €4, €5, and 
E20 of L. 
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COLOR FIGURE 2.4: Contours of the eigenvectors £1, £2, é3, €4, €5, and 
E20 of L. 
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COLOR FIGURE 2.6: The cut value 


under different cluster sizes (x-axis). The x-axis corresponds to the value of 


n in Figure 2.5. 
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COLOR FIGURE 2.7: Contours and the scores of six features. Among 
these features, Fı and F are relevant, and F3, F4, F5, and Fẹ are irrelevant. 
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COLOR FIGURE 2.13: Effects of noise on the feature ranking functions. 
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COLOR FIGURE 4.4: Study of supervised cases: Plots for accuracy 
(y-axis) vs. different numbers of selected features (x-axis) on the six data 


sets. The higher the accuracy, the better. 


COLOR FIGURE 6.7: Cluster analysis on the genes selected by 
KOFSpyop (left) and GO-REL-PROP (right), respectively. The color lines on 
the bottom of the figure correspond to the samples from patients of B-cell 
ALL (blue), T-cell ALL (red), and B-cell ALL with the MLL/AF4 
chromosomal rearrangement (green), respectively. 


Chapter 4 


Connections to Existing Algorithms 


Spectral feature selection is a general framework. In this section we show that 
a number of existing feature selection algorithms are essentially special cases of 
spectral feature selection. These algorithms include Relief and ReliefF [158], 
Laplacian Score [74], Fisher Score [45], HSIC [165], and Trace Ratio [130]. 
These algorithms are designed to achieve different goals. For instance, Fisher 
Score and ReliefF are designed to optimize sample separability, Laplacian 
Score is designed to retain sample locality, and HSIC is designed to maxi- 
mize feature class dependency. We can show that these algorithms actually 
select feature by evaluating features’ capability to preserve sample similarity 
in similar ways. 

In this chapter, we also study the connections between multivariate spec- 
tral feature selection and a number of well-known learning models, includ- 
ing principal component analysis (PCA) [85], linear discriminant analysis 
(LDA) [199], and least square support vector machine (LS-SVM) [178]. The 
study provides us an interesting insight into these existing models, and allows 
us to utilize the efficient solvers developed in Chapter 3 to generate sparse 
solutions for the models. We notice that spectral feature selection is for se- 
lecting original features, while PCA and LDA are for extracting new features 
from the original ones. So the multivariate formulations for spectral feature 
selection form a bridge connecting the two different types of dimensionality 
reduction techniques. 


4.1 Connections to Existing Feature Selection 
Algorithms 
We first show how a number of existing feature selection algorithms can 
be unified with our univariate formulations for spectral feature selection. To 


achieve this, we prove that all these algorithms can be reformulated in a 
common form: 


= PTF 
max dep. (F) = max f Sf, (4.1) 


sub 
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Here Es. is the set of selected features, and f and S are the normalized 
feature vector and the normalized sample similarity matrix, respectively. The 
only difference among these feature selection algorithms is that they use differ- 
ent ways to compute f and S. As we have analyzed in Section 3.1, if a feature 
selection criterion is in the form of Equation (4.1), it will select features by 
evaluating features” capability of preserving the sample similarity specified by 
S, and can be treated as a special case of spectral feature selection. 


4.1.1 Laplacian Score 


Laplacian Score [74] is an unsupervised feature weighting algorithm that 
uses a filter model. Given an adjacency matrix S, let D and L be its corre- 
sponding degree and Laplacian matrices, respectively. The Laplacian Score of 
f can be calculated via the equation 

f'Lf i f'D1 
f) = == h f=f-> 
PL | mp bere Di 

We show that y2(f) = yz (f) in the theorem below. Here pa (-) is the 

feature ranking function defined in Equation (2.17) in Chapter 2. 


1. (4.2) 


Theorem 4.1.1 The Laplacian Score [74], an unsupervised feature selection 
algorithm, is a special case of SPEC, when p(-) = pa(-). 


Proof: The feature evaluation criterion of the Laplacian Score is 


Plugging f in pr (F), we have 


(F) = fTLf _ (D2f)' £(D=t) 
MNI pp Ea = A 
17D1 (D2f)'(D2f) 


((D2#)"(D21))” 


(D21)7(D?1) 


If we let €1 be the first eigenvector of the normalized Laplacian matrix £, we 


have €1 = 2, Also in pa (+), f= 2E. Therefore, the following equation 
holds: Asia 
f Lf 
or (PF) =— Da Pa) 
1- (fra) 


Theorem 4.1.1 shows that the Laplacian Score is a special case of SPEC, 
and based on this theorem, we have the following theorem. 
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Theorem 4.1.2 Let $ be the similarity matrix, selecting | features using the 
Laplacian Score can be achieved by maximizing the objective function 


arg MAXp,, „Fi E Sî,, 
j=1 
where f and $ are defined as 
1 _ ET . 
= a Peta se > S=p-isp-i 
D (e) 


4.1.2 Fisher Score 


Fisher Score [45] is a supervised feature weighting algorithm with a filter 
model. Given the class label y = fy1,...,Yn), Fisher Score prefers features 
that assign similar values to the samples from the same class and different 
values to the samples from different classes. The evaluation criterion used in 
the Fisher Score can be formulated as 


Cc 2 
jai 25 (Hy — n) 


E 728 ’ 
jai njo; 


where u is the mean of the feature f;, n; is the number of samples in the 
jth class, and uj and o, are the mean and the variance of f; on the class j, 
respectively. 

As shown in [74], when the similarity matrix S is derived from the class 
label using the equation 


pr (Fi) = (4.3) 


1 
FIS _ nr? Yi=Yj= l 
Sat { 0, otherwise ` (4.4) 


Laplacian Score and Fisher Score are equivalent in the sense that 


1 


a (4.5) 


pr (Fi) 


Therefore, we have the following theorem. 


Theorem 4.1.3 Let S be the similarity matrix defined in Equation (4.4). To 
select l features using Fisher Score can be achieved by maximizing the following 
objective function: 


j=1 
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Here f and § are defined as: 
~ 3 A f — FT A 
E 2 | oe S§=D-?SD-?. 
|D=£| ia (Fr) 


4.1.3 Relief and ReliefF 


Relief [90] and its multiclass extension, ReliefF [94], are supervised fea- 
ture weighting algorithms using the filter model. Assuming M instances are 
randomly sampled from the data, the feature evaluation criterion of Relief is 
defined as 


M 
1 
) = 5 do (les — NM codil — Izei — NEG). 
t=1 


In the equation, x+; denotes the value of instance x; on feature f;. NH(x) 
and NM(x) denote the nearest points to x in the data with the same and 
different labels, respectively, and || - || is a distance measurement. To handle 
multiclass problems, the above evaluation metric is extended in ReliefF to the 
equation 


er (A) = Lo Y les al 


Mi OL(xs) x¡ENH(x) 


P(C) 1 
+ x x [Eti Baal] 
co) (I= PCHRD * Mes ™ Peace 
(4.6) 

Here, CL(x¿) returns the class label of the instance x+, and P(C) is the 
probability of instances belonging to the class C. x+; is the value of the feature 
f; on the instance x+. NH(x) denotes the set of samples that are nearest to x 
and with the same class of x. A sample in N H(x) is called a “nearest hit” of x. 
NM (x, C) denotes the set of samples that are nearest to x and with the class 
label C (C # CL (x¿)). And a sample in NM (x) is called a “nearest miss” of 
x. Mt cL(x) is the size of NH(x), and M; a is the size of NM (x, C). Usually, 
the sizes of both NH(x) and NM(x,C) are set to a prespecified constant. 

The relevance evaluation criteria of Relief and ReliefF show that the two 
algorithms seek features that contribute to the separation of samples from 
different classes. 

Assume that the training data have c classes with p instances in each 
class; there are h instances in both NH(x) and NM(x,C); and all features 
have been normalized to have the unit norm. As shown in [222], under the 
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specified assumptions, the feature relevance evaluation criterion of ReliefF can 
be formulated as 


Z 1 ento AN 
YE SY e (47) 


i=1 |jeNH(x;) Czyi 


In the above equation, f; is the value of the feature f on the ¿-th instance, 
x;. Here we use the Euclidean distance to calculate the difference between 
two values, and use all training data to train ReliefF. To study the connection 
between ReliefF and spectral feature selection, we define a similarity matrix 
S as 


de a 
SFEL = gi: zj € NH (xi) i (4.8) 
DR *Yj€ NM (x;, CL(x;)) 


To ensure that SPFL is symmetric, we assume that if 2; € NH (x;), we 
also have x; € NH (xj); and if z; € NM (x;, CL(x;)), we also have x; € 
NM (xj, CL(x;)). By applying Theorem 2.2.1, it is easy to verify that D = I, 
and yr (F;) = —1+f£!SRELE is equivalent to the evaluation criterion defined 
in Equation (4.7). Since pi (F;) = —1 + f!SRELE, where (1 (-) is the feature 
evaluation criterion defined in Equation (2.12), we can see that under these 
assumptions, ReliefF also forms a special case of SPEC. Based on the above 
observation, we have the following theorem. 


Theorem 4.1.4 Let S be the similarity matrix defined in Equation (4.8). 
Selecting | features using ReliefF can be achieved by maximizing the objective 
function 


j=l 
Here, f and $ are defined as 
z Dif . 
PSDs SD, (4.9) 
|D=£]] 


4.1.4 Trace Ratio Criterion 


The Trace Ratio Criterion for feature selection is proposed in [130]. Tt 
defines two adjacency matrices, S,, and Ss. S,, represents the within-class or 
local adjacency relationship of instances, whereas S, represents the between- 
class or the global counterpart. Two graphs, Gwu and G», can be constructed, 
and their corresponding graph Laplacian matrices are L,, and Lp, respectively. 
Assuming we want to select k features, W = [w;, , w;,,*** ,Wi,] € R”** is the 
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selection matrix, where the column vector w;, has one and only one “1” at its 
¿j-th element, and (î1,î2,::- ,ix} € {1,2,---,n}. The Trace Ratio Criterion 
tries to find the best selection matrix W by maximizing the following objective 
function 
a aaa trace(W'X'L,XW) 

w  trace(W'X'L, XW) 
As shown in [130], the optimal solution of the problem can be obtained by 
iteratively solving the following two subproblems. First, when A; is mixed, we 
solve problem (P1): 


(P1): Wii = arg max trace (W! X! (Ly — \iLw) XW). (4.11) 


(4.10) 


Second, when W, is fixed, we solve problem (P2): 


trace(W/,¡X'L¿XW;,1) 


trace(W¡, XL, XW;+1) i 


Since 


trace (WX! (Ly — ALu)XW) = >` f£; (Le — AL) fi, 


iEfi1,i2, 0: tn} 
it is easy to verify that when A is fixed, the subproblem (P1) can be solved 
by picking the top k features with large f; (Ly — AL.,) f; values. Therefore, 
although the Trace Ratio Criterion is proposed for subset feature selection, 
features are actually evaluated independently in the feature selection process. 
We have the following theorem to build a connection between the Trace Ratio 
Criterion and the SPEC framework. 


Theorem 4.1.5 Assume A* is optimal for Equation (4.10). Selecting l fea- 
tures using the Trace Ratio Criterion can be achieved by maximizing the fol- 
lowing objective function: 


l 
Nees AA 
arg MaXF,, ,...,Fi, > E, S fi, 


j=1 
Here, f and $ are defined as 
f =f, S=(Lp-A*Ly). (4.13) 
a 


The theorem suggests that to maximize f ' (Ly — AL.,) f, a feature needs to 
simultaneously maximize f'L, f, which requires assigning different values to 
samples that are from different classes; and minimize f'L,, f, which requires 
assigning similar values to samples that are from the same class.! The Trace 
Ratio Criterion selects features in a similar way as the Fisher score. Actually, 
it is shown in [130] that with specific definitions for Lẹ and Ly, The Trace 
Ratio Criterion is equivalent to the Fisher Score method. 


1) is used to balance the two components in the criterion. 
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4.1.5 Hilbert-Schmidt Independence Criterion (HSIC) 


The Hilbert-Schmidt Independence Criterion (HSIC) was first proposed 
in [62] for measuring the dependence between two kernels. In [165], HSIC 
is applied for feature selection, and the basic idea is to select a subset of 
features, such that the kernel constructed using the feature subset maximizes 
HSIC when compared to a given kernel K. In [165], an unbiased estimator of 
HSIC is given as: 


1 1'Kp11'K1 2 
== __ | Trace (KK 
ara a) a ads roo 


pu (F) = 1'KpK1| . (4.14) 
In the equation, F is a subset of the original features and Kp is the kernel 
obtained from F. To achieve an unbiased estimation, HSIC requires the diag- 
onal elements of K and Krp to be set to 0. Based on HSIC, features can be 
selected via either backward elimination or forward selection. Using a general 
kernel in HSIC can be very time-consuming, due to the complexity of the 
kernel construction step in each iteration. Therefore, a linear kernel is usually 
used. It is shown in [164] that when a linear kernel is used for constructing 
Kg, selecting k features using HSIC can be achieved as solving the problem 


l 
a 
arg maxr, „Pa y El Susie fi, 
j=1 


where 


17K1 2 
n—1)(n—2)n— 


Susrc = T 3) K + (117 Dz 


5 (K117 — diag ax1))| 
(4.15) 


It is clear that in this case, HSIC forms a special case of SPEC, which is 
formally stated in the following theorem. 


Theorem 4.1.6 When a linear kernel is applied, selecting | features using 
HSIC can be achieved by maximizing the following objective function: 


l 
at é A A 
arg MaX F; ,...,Fi, > f, S ijs f =f; S = Susrc. 
j=1 


4.1.6 A Summary of the Equivalence Relationships 


We show that five existing representative feature selection algorithms, in- 
cluding Laplacian Score, Fisher Score, ReliefF, Trace Ratio, and HSIC, all fit 
into the framework formulated in Equation (3.1). In Table 4.1, we summarize 
the sample similarity matrix and the corresponding normalization criteria used 
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in these algorithms. It turns out that although these algorithms were originally 
designed to achieve different goals, they actually select features via estimating 
their capability toward preserving sample similarity. One limitation seen in all 
these algorithms is that they evaluate features independently, causing them to 
be unable to handle redundant features. This is a common drawback of these 
algorithms. To address this drawback, the multivariate formulations presented 
in Chapter 3 for spectral feature selection can be utilized. 


4.2 Connections to Other Learning Models 


In the last section, we showed that many existing feature selection algo- 
rithms can be reformulated as special cases of the univariate formulations 
for spectral feature selection. In this section, we show that the multivariate 
formulation for spectral feature selection, 


; 2 
arg min [Y — XW e +A |W]: 
st. A={i: wėl > 0), Card (A) = l, (4.16) 


can also be connected to some well-known learning models, including the 
principal component analysis (PCA) [85], the linear discriminant analysis 
(LDA) [51], and the support vector machine (SVM) [187], through their least 
square formulations [178, 176]. These connections provide us interesting in- 
sights, and allow us to compute sparse (or sparser) solutions for these models 
using the techniques developed in Chapter 3. Note that spectral feature selec- 
tion is for selecting original features, while PCA and LDA are for extracting 
new features from the original ones. Therefore, the multivariate formulations 
for spectral feature selection form a bridge that effectively connects two dif- 
ferent types of dimensionality reduction techniques and allows for their joint 
study. 


4.2.1 Linear Discriminant Analysis 


Linear Discriminant Analysis (LDA) [51, 200, 204, 207, 202, 203, 201, 196] 
is a supervised feature extraction approach. Assuming samples are from k 
different classes. LDA tries to generate a k — 1 dimensional space to repre- 
sent the data, such that in this space the samples from different classes are 
well separable. LDA extracts new features by linearly combining the original 
features. 
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Example 17 Linear Discriminant Analysis (LDA) 


Figure 4.1 shows how linear discriminant analysis works. In this ex- 
ample, the data X € R”*? distribute in a two-dimensional space (x-y 
axes), and the samples are from two different classes. LDA determines a 
weight vector w € R?*!, and uses w to project the samples to generate 
one-dimensional data 


X=Xw, XER”. 


Let SEP (w) be a measurement that evaluates the separability of the 


one-dimensional data X, e.g., the measurement defined in Equation (4.20). 
LDA tries to find the w that maximizes SEP (w). Note that, in this ex- 


ample, X contains only one feature, and it is a linear combination of the 
original features in X. 


Below, we present the sample separability measurement used in LDA. To 
this end, we first define the total covariance matrix and the between-class 
covariance matrix. We then define the sample separability measurement based 
on the two covariance matrices. 

Let n be the number of samples, n; the number of samples in the j-th 
class, x; the i-th sample, c the mean of the data, and c; the mean of the 
data in the j-th class.? The total covariance matrix S; and the between-class 
covariance matrix S, are defined as 


S = -Y (x-0(x-0)!, 
n 4 
i=1 
k 
1 | | T 
Sp = —) nj (e — c) (co — c) 
n 4 
j=1 
where c = A ci) = a 5 Xi. 
N £ : Ni 
i=1 ieclassj 


Let X = XW be the data obtained by projecting the original data X 
into the lower-dimensional space generated by LDA, where W € Rm*(c-1) 
is the projection matrix.* Let S, and $, be the total and the between-class 
covariance matrices of the data in the lower dimensional space, respectively. 


2x5, C, cj € R™*! are column vectors. 
3m is the number of features, c is the number of classes. Note that W has c— 1 columns, 
since rank (S) =c— 1. 
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FIGURE 4.1: Linear discriminant analysis. 
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We have 


S, = W'S,W, (4.17) 
S, = W'S,W. (4.18) 


Based on the the total and the between-class covariance matrices, a popular 
criterion for measuring sample separability in LDA [199] is defined as 


Trace (S; *S») . (4.19) 


It can be verified that for a data set, when samples from the same class 
are near one another, and samples from different classes are far apart, the 
above equation will return a large value [201]. Based on this observation, the 
projection matrix W in LDA is obtained by solving the problem 


W* = arg max (Trace (S1S,)). (4.20) 


It can be shown that the c—1 columns of W* are given by the c—1 eigenvectors 
of S; *S,, corresponding to the c— 1 largest eigenvalues [199]. 

In Least Square LDA (LSLDA) [204], instead of solving an eigenvalue 
problem, we compute W by solving a least square problem. Let y; be the 
label of the i-th sample. We can define a target matrix, Y as 


yipa_) Vu Vn a (4.21) 
a+ otherwise. 


Assuming that X is centralized, we can verify that the following equation 


holds: 1 
Se = RUAY PAI, S: = XX". (4.22) 


And the objective of LSLDA is defined as 


min | XW — YA Wee, (4.23) 


In [204], it is shown that when rank(S;) = rank(S,,) + rank(S;), LSLDA 
is strictly equivalent to LDA. It turns out that this condition is quite mild, 
and it usually holds when the dimensionality of the data is high. 

Comparing Equation (4.16) with Equation (4.23), we can see that the 
multivariate formulation for spectral feature selection and LSLDA share sim- 
ilar objective functions. Their sole difference is that the former one applies 
a sparse regularization to ensure that only l features are used to construct 
the optimal solution. Therefore, when YLP4 is used as the target matrix in 
Equation (4.16), we can obtain a sparse solution for LSLDA, and the lin- 
ear combination of the selected feature will form a lower dimensional space, 
in which samples from different classes can be separated well. This analysis 
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shows that the sparse LSLDA (SLSLDA) forms a special case of the multi- 
variate spectral feature selection by using Y4P4 as the target matrix. Also 
we can show that when S is the matrix defined in Equation (4.4), KzsLDaA 
can be formulated as 


KusLDA = pa) (nat sz nst1S = 117. (4.24) 


Since the matrix 11! and n are constant, it turns out that the Fisher Score and 
the LSLDA essentially specify the same simple similarity. Unlike the Fisher 
Score, the LSLDA is a supervised feature extraction algorithm, which gener- 
ates new features by combining original features. 


4.2.2 Least Square Support Vector Machine 


Support Vector Machine (SVM) [34, 36, 167, 150, 73, 229, 27, 22, 34] is a 
supervised learning model that has been widely applied in various data mining 
applications [216, 83, 68, 103, 184, 55, 20, 84]. 


@ Support vectors of the positive class 


© Support vectors of the negative class 
FIGURE 4.2: A linear separating hyperplane of support vector machine. 


As shown in Figure 4.2, given a binary classification problem, SVM tries to 
find a hyperplane, which can separate the samples from different classes with 
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a large margin. This idea can be formulated using the following equation. 


min |||] 
s.t. Yi (x; w + b) >1-&, Vi 


€ >0, E <C. (4.25) 


i=1 


In the above equation, x; is the i-th sample, and y; € {—1, 1} is its label. 
w and b are the weight vector and the interception defining the hyperplane 
h= {x|x"w +b= 0). Given a sample x, x! w+b computes the distance from 
x to the hyperplane h. It is easy to verify that if x is below the hyperplane, 
x'w +b <0, otherwise x'w +b > 0. 

Let the region below h correspond to the negative class, and the other side 
corresponds to the positive class. The condition, yi (xi w + b) > 1, forces all 
the samples to stay in the region that their classes belong to. And additionally, 
they must be at least fel away from the hyperplane h [72]. Therefore, ical 
measures the size of the margin corresponding to classifying the samples using 
the hyperplane h. 

Since the samples are not always linearly separable, slack variables €;, i = 
1,...,n are introduced to allow some samples to stay on the wrong side: 


yi (x wtb) >1-&, Vi. 


Obviously, if y; (x w + b) < 0, x; will be on the wrong side, and corre- 
spondingly, £; > 0. Similarly, if y; (x w + b) > 1, x; will be on the correct 
side, and correspondingly, €; = 0. This observation suggests that » €, can be 
used to bound the total number of training misclassifications. In SVM, all the 
samples that satisfy y; (x w + b) < l are called support vectors, and they are 
either in the margin area, or on the wrong side of the hyperplane h. 

The training of SVM involves solving a convex quadratic optimization 
problem (QP). Solving QP is computationally expensive, due to the need 
to compute a dense Hessian matrix. Solving the QP with a general-purpose 
QP solver would have a time complexity of O(n*) [18], which is not scalable. 
To address this problem, powerful optimization techniques such as the interior 
point method [194] and the coordinate descent method [29] have been applied 
for solving the problem in very efficient ways. 

Instead of solving a QP problem, in the least square SVM (LSSVM) [178], 
the SVM formulation is approximated and reformulated as a least square 
problem: 

min IXW — vs 2. W e RAE, 


1 7 
SVM _ Vi =] 
Yi ={ -1 otherwise. (4.26) 


In [208], it is shown that for high-dimensional data, LSSVM and SVM behave 
similarly. 
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Comparing Equation (4.16) with Equation (4.26), we can see that when 
Y? YM is used in both equations, the only difference between the multi-variate 
formulation for spectral feature selection and the LSSVM is that the former 
applies a sparse regularization to ensure that only l features are used to con- 
struct the solution. Therefore, when YSY is used as the target matrix in 
Equation (4.16), we can obtain a sparse solution for LSSVM. This analysis 
shows that sparse LSSVM (SLSSVM) forms a special case of multivariate 
spectral feature selection by using YSYM as the target matrix. 


4.2.3 Principal Component Analysis 


Principal component analysis (PCA) [85] is an unsupervised feature ex- 
traction approach. Given a data set X, the PCA generates a set of principal 
components (the new features), which maximizes the variance of the data pro- 


n 
jected on them. Given the covariance matrix C = 197 (x; — c)(x; c)”, the 
i=1 


principal components of X can be obtained by computing the top eigenvectors 
of C corresponding to the largest eigenvalues. 


Example 18 Principal components of two-dimensional data 


Figure 4.3 shows the distribution of a two-dimensional data set X. Since 
X e R”*?, its covariance matrix C is of rank 2, C € R?*?. Therefore, 
using PCA we can obtain two principal components for X: w and w2. 
w:1 is the largest principal component. It is the direction that maximizes 
the variances of the projected data, Xw ,. wa is the smallest principal 
component. w and wa are orthogonal. 

If we let P = I — 1117, we have C = (PX) (PX). Based on this 
observation, we can compute the variance of Xw; using the equation 


Var (Xw,) = (PXw,)' (PXw,) = w] Cw. 


To maximize the above equation, we have w = €, where €, is the 
eigenvector of C corresponding to the largest eigenvalue. Similarly, let X 
be an m-dimensional data, X € R"*"”. Its first l principal components can 
be obtained by computing the top l eigenvectors of C, which corresponds 
to the l largest eigenvalues. 


Below we show the connection between the multivariate formulation for 
spectral feature selection and the least square formulation for PCA. 

Assume that m > n, and X= UEV! is the SVD of X. Let U = 
{ui,...,Un}, V = [v1,..., Vn}, and 2 = diag{A1,..., An}. It is shown in [230] 
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FIGURE 4.3: The principal components determined by PCA. 


that the principal components generated by PCA can be equally obtained by 
solving the following problem: 


w; = arg max lly; — Xwille 


Based on this observation, the authors in [230] proposed to obtain the 
sparse solutions for PCA by solving an L-norm regularization problem: 


w; = arg max lly; — Xwille 
[will < t, 


By considering the k principal components together, Equation (4.28) can be 
written as: 


W = arg max |Y — XxX WI y, 
W=(w1,...,Wx), wil < t, 
Y =Vire, (4.29) 


where Vk = [v1,..., Vk}, and Xa = diag[A1,..., Ag). 
Comparing Equation (4.27) and Equation (4.29) with Equation (4.16), it 
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is clear that PCA eccentrically projects samples to a lower-dimensional space, 
where the similarity between samples measured by their inner product (in 
the original space) is best preserved. Therefore, the sparse PCA formulation, 
specified in Equation (4.28), can be regarded as a special case of the multi- 
variate spectral feature selection, when a linear kernel is used to define the 
similarity among samples. As compared to sparse PCA, the usage of the La 1 
norm in Equation (4.16) makes the formulation more suitable for feature se- 
lection. The £¡-norm used in Equation (4.28) ensures that each w; is sparse. 
However, in different w;, different features are used. When considering all w; 
together, many features may still be used in the obtained W. In contrast, the 
Lo norm ensures that only a small set of the features are used in constructing 
the whole W. 


4.2.4 Simultaneous Feature Selection and Extraction 


Feature selection achieves dimensionality reduction by selecting a small set 
of the original features. Feature extraction reduces dimensionality by generat- 
ing a small set of new features via combining the original features. Currently, 
feature selection and feature extraction are largely studied independently. The 
above analysis shows that feature selection and feature extraction can actually 
be done simultaneously with the multivariate formulation for spectral feature 
selection. 

By applying feature selection, the redundant and irrelevant features can 
be removed. This helps us effectively reduce the noise in the data. By applying 
feature extraction, we can combine the selected features and generate a smaller 
set of new features. Compared to the original features, the newly generated 
features may possess stronger discriminative power, and usually result in bet- 
ter learning performance. The multivariate formulations for spectral feature 
selection form a bridge connecting feature selection with feature extraction, 
and allow us to take full advantage of both techniques. 


4.3 An Experimental Study of the Algorithms 


In this section we empirically evaluate the performance of various spectral 
feature selection algorithms in both supervised and unsupervised learning con- 
texts. In the experiments, we compare nine feature selection algorithms. For 
supervised learning, eight feature selection algorithms are chosen for com- 
parison: ReliefF, Fisher Score, Trace Ratio Criterion, HSIC, MRSF, MCSF, 
mRMR [40], and AROM-SVM [192]. The first six are spectral feature selec- 
tion algorithms. The last two are existing state-of-the-art feature selection 
algorithms. Both of the algorithms are able to handle redundant features, and 
are used in the experiment as the baseline algorithms for comparison. For un- 
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TABLE 4.2: Summary of the benchmark data sets 


Data Set | 4 Features | # Instances | # Classes 
PIE10P 2400 210 10 
ORL10P 10000 100 10 
TOX 5748 171 4 
CLL-SUB 11340 111 3 


supervised learning, six algorithms are used for comparison: Laplacian Score, 
SPEC, Trace Ratio Criterion, HSIC, MRSE, and MCSF. They are all spec- 
tral feature selection algorithms. For MRSF and MCSF, in supervised learning 
context, S is calculated by Equation (4.4); and in unsupervised learning, S 
is calculated by the Gaussian RBF kernel function. Four high-dimensional 
data sets are used in the experiment. They are two image data sets: PIE10P* 
and ORL10P?; and two microarray data sets: TOX and CLL-SUB.* Detailed 
information of the benchmark data sets is listed in Table 4.2. 

Let F be a set of the selected features, and Xp is the data set only con- 
taining the features in F. In the supervised learning setting, algorithms are 
compared on (1) classification accuracy and (2) redundancy rate. The 
redundancy rate is measured by 


1 
RED(F)===— Y pig, 


TOMA) yoga E 


where p; į measures the correlation between the i-th and j-th features. A large 
value of RED (F) indicates that many selected features are strongly correlated 
and thus redundancy is expected to exist in F. 

In the unsupervised learning setting, three measurements are used to com- 
pare the performance of the feature selection algorithms: (1) redundancy 
rate defined in Equation (4.3); (2) scale of the residue; and (3) Jaccard 
score [81]. The scale of the residue is computed by the equation 


[XpXp "KI. 
The Jaccard score is computed by the equation 
14 NB(i,k, Kp) NB (i,k,K) 
JAC (Kp, K, k) = 
sea) DI NB (i,k, Ke) U NB (i,k, K)’ 


i=1 


Za 


where Kp = XpXp and K are the linear kernel constructed using the selected 


“http: //peipa.essex.ac.uk/ipa/pix/faces/manchester/. Images are subsampled down to 
the size of 6040. 

Shttp://www.uk.research.att.com/facedatabase.html. Images are subsampled down to 
size of 100x100. 

SBoth data are retrieved from the Gene Expression Omnibus gene expression repository 
(http: //www.ncbi.nlm.nih.gov/geo/) with retrieval IDs: GDS1454 and GDS968. 
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features and the input similarity matrix, respectively. Here NB (i, k, K) re- 
turns the k nearest neighbors of the i-th instance according to the pairwised 
similarity specified by K. The Jaccard score measures the averaged overlap- 
ping of the neighborhoods specified by Kp and K. A high Jaccard score 
indicates that the pairwised similarities specified by the two similarity ma- 
trices are consistent. The scale of the residue and the Jaccard score are used 
to assess an algorithm's capability of preserving the sample similarity in the 
continuous and the discrete ways, respectively. 

In the supervised learning context, for each data set, we randomly sample 
50% data for training, and the remaining for testing. The process is repeated 
20 times. Different algorithms are evaluated. The results are averaged to obtain 
the final results. Linear SVM is used for classification. The parameters for 
feature selection algorithms and the SVM are tuned with cross-validation, if 
necessary. The Student's t-test is used to evaluate the statistical significance 
with p-value < 0.05. For the unsupervised learning setting, all the samples are 
used for feature selection. The selected features are evaluated using the three 
measurements mentioned above. 


4.3.1 A Study of the Supervised Case 
4.3.1.1 Accuracy 


The classification accuracy results are shown in Figure 4.4 and Table 6.5. 
Figure 4.4 contains the plots of the accuracy of the SVM classifier using the 
top 10, 20, ..., 200 features selected by each algorithm. Table 6.5 shows the 
aggregated accuracy of different algorithms on each data set. The aggregated 
accuracy is obtained by averaging the averaged accuracy achieved by SVM 
using the top 10, 20, ..., 200 features selected by each algorithm. The value in 
the parentheses is the p-Value. In Figure 4.4 and Table 6.5, we can observe that 
MRSF produces superior classification performance compared to the other 
algorithms. The performance of MCSF is also good, which is the second best 
in the test. 


4.3.1.2 Redundancy Rate 


Table 4.4 presents the averaged redundancy rates of the top n features 
selected by different algorithms, where n is the number of samples. We choose 
n, as when the number of selected features is larger than n, any feature can 
be expressed by a linear combination of the remaining ones, which introduces 
unnecessary redundancy in evaluation. In the table, the boldfaced values are 
the lowest redundancy rates or the ones without significant difference to the 
lowest. The results from the redundancy rates show that MRCF and MRSF 
all attain low redundancy rates, which suggests that the redundancy removal 
mechanisms in both algorithms are effective. We also observe that the two 
baseline algorithms mRMR and AROM-SVM also produce low redundancy 
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for accuracy (y-axis) vs. different numbers of selected features (z-axis) on the 
six data sets. The higher the accuracy, the better. 
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TABLE 4.5: Study of the unsupervised cases: Averaged residue scale, the 
lower the better. 


Algorithm ORL PIE TOX | CLL-SUB | AVE | WIN 
Laplacian Score | 112.80 | 208.97 | 178.69 127.67 157.03 0 
SPEC-1 112.79 | 220.42 | 179.21 126.65 159.77 0 
SPEC-3 122.51 | 201.25 | 176.15 120.74 155.16 0 
Trace-Ratio 112.79 | 200.68 | 179.30 119.70 153.12 0 
HSIC 112.79 | 200.64 | 179.24 119.79 153.12 0 
MCSF 91.03 | 183.04 | 166.23 107.25 | 136.89 4 
MRSF 91.19 187.05 | 167.07 108.96 138.57 0 


rates. Since the two algorithms are able to remove redundant features, the 
observation is consistent with our expectation. 


4.3.2 A Study of the Unsupervised Case 
4.3.2.1 Residue Scale and Jaccard Score 


Tables 4.5 and 4.6 present the averaged residue scale and Jaccard score 
achieved by different algorithms on the benchmark data sets, when the top n 
features are selected. The two measurements assess algorithms’ capability of 
similarity preserving. The results show that among all the algorithms, MCSF 
and MRSF achieve better performance on all four data sets, indicating their 
strong capability of sample similarity preserving. 

We observe that the performance of MRSF is inferior to that of MCSF. 
The reason is that MRSF optimizes [XWW'X' — K||%,, while in residue 
scale and Jaccard score, XpXp is used to compute Kg. It is possible that 
the performance of MRSF is underestimated by the two measurements, since 
MRSF may select features whose linear combination can produce a similarity 
matrix that well preserves the similarity specified by K. To clearly show this, 
Table 4.7 lists the aggregated residue scale and Jaccard score of MRSF and 
MCSF, when XWW'X' is used to compute Kp in the two measurements. 
Here the aggregated residue scale and Jaccard score are obtained by averaging 
the averaged residue scale and Jaccard score over four benchmark data sets. 
The results show that when the weight matrix W is taken into account, MRSF 
achieves better results than MCSF. This indicates that the features selected by 
MRSF also have strong capability of similarity preserving through their linear 
combinations. This fact is also verified by the good classification performance 
of the SVM classifier on the features selected by MRSF. 


Connections to Existing Algorithms 105 


TABLE 4.6: Study of unsupervised cases: Averaged Jaccard score, the higher 
the better. 


Algorithm ORL PIE TOX CLL-SUB | AVE | WIN 
kNB = 
Laplacian Score 0.04 0.02 0.08 0.03 0.04 0 
SPEC-1 0.04 0.03 0.09 0.02 0.05 0 
SPEC-3 0.02 0.06 0.09 0.05 0.05 0 
Trace-Ratio 0.06 0.04 0.13 0.06 0.07 0 
HSIC 0.06 0.04 0.13 0.03 0.06 0 
MCSF 0.53 0.26 0.50 0.19 0.37 2 
MRSF 0.60 0.35 0.35 0.19 0.37 3 
kNB = 
Laplacian Score 0.09 0.05 0.14 0.09 0.09 0 
SPEC-1 0.09 0.05 0.14 0.11 0.10 0 
SPEC-3 0.13 0.07 0.16 0.12 0.12 0 
Trace-Ratio 0.08 0.07 0.15 0.12 0.11 0 
HSIC 0.08 0.07 0.15 0.13 0.11 0 
MCSF 0.72 0.27 0.47 0.23 0.42 3 
MRSF 0.70 0.44 0.35 0.22 0.43 1 


TABLE 4.7: The aggregated residue scale and Jaccard score of MCSF and 
MRSF when the weight matrix W of MRSF is considered in calculating the 
measurements. 


Algorithm | Residue | JACkyg=1 | JACkyg=5 
MCSF 136.89 0.37 0.42 
MRSFw 91.45 0.63 0.53 


4.3.2.2 Redundancy Rate 


Table 4.8 shows the averaged redundancy rates with the top n features 
selected by different algorithms on the benchmark data sets. The results show 
that the features selected by the MCSF and MRSF contain much less re- 
dundancy comparing with the other algorithms. This is expected, since those 
algorithms cannot remove redundant features. 
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TABLE 4.8: Study of unsupervised cases: Averaged redundancy rate, the 
lower the better. 


Algorithm ORL PIE TOX | CLL-SUB | AVE | WIN 
Laplacian Score 0.86 0.75 0.46 0.65 0.68 0 
SPEC-1 0.86 0.83 0.47 0.64 0.70 0 
SPEC-3 0.90 0.68 0.41 0.51 0.63 0 
Trace-Ratio 0.86 0.70 0.48 0.58 0.65 0 
HSIC 0.86 0.70 0.48 0.58 0.65 0 
MCSF 0.28 0.38 0.16 0.15 0.24 3 
MRSF 0.28 0.26 0.22 0.29 0.26 2 


4.4 Discussions 


In this chapter, we studied the connections between spectral feature se- 
lection and a variety of exiting feature selection and feature extraction al- 
gorithms. The study provides us interesting insights into these existing al- 
gorithms, and enables us to develop more powerful dimensionality reduction 
approaches based on the presented spectral feature selection frameworks. We 
conducted experiments to study the performance of various spectral feature 
selection algorithms. The experiment results from both supervised and un- 
supervised learning settings showed that the multivariate formulations for 
spectral feature selection can select features containing less redundancy and 
producing superior learning performance. 

Given a high-dimensional space, many approaches have been proposed to 
find a low-dimensional space, where the geometric structure of the data is 
preserved according to certain criteria. These methods in general fall into 
the category of dimensionality reduction via feature extraction, instead of 
feature selection. The representative algorithms in this category include Mul- 
tidimensional Scaling (MDS) [35], ISOMAP [182], Locally Linear Embedding 
(LLE) [145], Laplacian Eigenmaps [13], Semidefinite Embedding (SDE) [191], 
Neighborhood Preserving Embedding (NPE) [75], and Structure Preserving 
Embedding (SPE) [155], to name a few. The difference between feature extrac- 
tion and feature selection is that to reduce dimensionality, feature extraction 
generates a small set of new features by combining the original features, while 
feature selection selects a small set of original features exactly. By keeping 
the original features, feature selection improves the interpretability of learn- 
ing models, which is preferred in many real applications, such as text mining 
and genetic analysis. Many frameworks have been proposed to unify the afore- 
mentioned feature extraction methods in various ways [148, 195, 198, 24, 176]. 
Comparing with these works, our multivariate formulations for spectral feature 
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selection form a bridge connecting feature selection with feature extraction, 
and allow us to take full advantage of both techniques. 

A data set with both very high dimensionality and huge numbers of sam- 
ples proposes a serious challenge to feature selection. In the next chapter, we 
study how to how handle very large-scale data sets in spectral feature selection 
via distributed parallel processing. 


Taylor & Francis 
Taylor & Francis Group 
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Chapter 5 


Large-Scale Spectral Feature Selection 


Continual advances in computer-based technologies have enabled researchers 
and engineers to collect data at an increasingly fast pace. Business and sci- 
entific data from many fields, such as finance, genomics, and physics, are 
often measured in gigabytes (GB, 2% bytes), terabytes (TB, 21? bytes), and 
sometimes even petabytes (PB, 215 bytes). For instance, it is reported that 
in 2010, one of eBay's data warehouses reached 10PB and will grow to 20PB 
in 2011. Other business operators, such as Bank of America, WalMart, and 
Dell also reported their data warehouses to be in a PB range. The enormous 
proliferation of large-scale data sets brings new challenges to data mining 
techniques. Scalability and efficiency are two critical issues in large-scale ap- 
plications [54, 31, 215, 66, 3, 139, 173, 193]. To address these challenges, 
existing data mining techniques need to be adapted and improved to handle 
large-scale data sets [16, 154, 101, 86, 194, 28, 61, 19]. 

Given a large-scale data set containing a huge number of samples and 
features, the scalability of a feature selection algorithm becomes extremely 
important [160, 65, 57]. Most existing feature selection algorithms are de- 
signed for handling data with a size under several gigabytes. Their efficiency 
may be significantly downgraded, if not totally unapplicable, when the data 
size reaches hundreds of gigabytes. Efficient distributed programming models 
and protocols, such as the Message Passing Interface (MPI) [163] and Google's 
MapReduce [1], have been proposed to facilitate programming on high perfor- 
mance computer grids [105] or clusters [10] to handle large-scale data prob- 
lems.! However, most existing feature selection techniques are designed to run 
in a centralized computing environment. For instance, it is assumed that all 
data can be held in the memory or, at least, all data are stored in one cen- 
tral storage space. Therefore, these algorithms cannot benefit from advanced 
distributed parallel computing techniques for improving efficiency and scala- 
bility. 

In this chapter, we show that the proposed spectral feature selection tech- 
nique can be conveniently extended to handle large-scale data via applying 
mature distributed parallel computing techniques such MPI or MapReduce. 
The key idea is data partition, as shown in Figure 5.1. To fit spectral feature 


1Both computer grid and computer cluster refer to a set of computers that are connected 
so that they can solve a problem together. The computer nodes of a grid are loosely coupled, 
while the computer nodes of a cluster are tightly coupled. Also the computer nodes in a 
cluster are homogenous, while the computer nodes in a grid can be heterogeneous. 
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FIGURE 5.1: Data partition for large-scale spectral feature selection. 


selection into a distributed computing environment, we first decompose the 
computation involved in feature evaluation into different types of summation 
computation over the samples. This allows us to distribute samples to the 
computer nodes of a grid (or cluster), so that different portions of the sum- 
mation computation can be done in parallel to generate local results. These 
local results can then be aggregated to obtain a global result. It can be shown 
that when the sample size is huge, this computing scheme can result in a linear 
speedup as the number of computer nodes used for computation increases. 

In the MPI framework, the computer nodes that generate the local re- 
sults are called slaves, and the computer node that controls the slaves and 
aggregates the local results is called the master. Similarly, in the MapReduce 
framework, the functions that generate the local results (run in parallel) are 
called mappers, and the functions that aggregate the local results are called 
reducers. In this chapter, we show how spectral feature selection can be im- 
plemented using MPI for distributed parallel processing. This technique can 
also be implemented using MapReduce in a similar way. 

In the following sections, we will first use linear regression as an example 
to illustrate how we can implement it in a distributed computing environment 
by partitioning data and distributing samples to computer nodes. We then 
show how to implement spectral feature selection in a distributed computing 
environment using MPI. We study both univariate and multivariate formu- 
lations for spectral feature selection. We also conduct complexity analysis to 
study the efficiency of the implementations. 
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5.1 Data Partitioning for Parallel Processing 


It is shown in [32] that many existing machine learning models, such as 
Naive Bayes, logistic regression, support vector machine, and k-means, can be 
parallelized in four steps: 


1. Decompose the training process into summation forms over samples. 
2. Partition data and storing data segments on nodes of the cluster. 

3. Compute local results in parallel on computer nodes of the cluster. 
4. Calculate the global result by aggregating the local results. 


Below we use standard linear regression as an example to show how this 
idea works. The objective of the standard linear regression is 


w* = arg min [ly — Xwl|, (5.1) 
w 
and its optimal solution is given by 


w* = (XTX) "Xy. (5.2) 


In the above equation, we assume that each row of X is an instance, X = 
IX. tza), To obtain the optimal solution, we need to compute X! X and 
X'y. Let A= X'X and b = X' y, and we can write A and b as two types 
of summation computation over the samples: 


A = O 
t= 


b = X'y=J yx (5.3) 
i=l 


As shown in Figure 5.2, assume our data are stored on three computer 
nodes (Slave 1, Slave 2, and Slave 3) of a cluster, and let P,, Pa, and P3 
denote the index sets containing the indices of the instances stored on the 
three slaves, respectively. Equation (5.3) suggests that we can compute three 
local results of A on the three slaves as 


Aj = oe j=1,2,3. (5.4) 
1G P; 
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FIGURE 5.2: Partitioning data for parallel processing. 


Since the computations of A;, Az, and Az are independent, they can 
be computed in parallel on each slave. After A1, Az, and Az are obtained, 
they can be sent to the master. We can compute A on the master by A = 
A; + As + As. Similarly, we can also compute b in parallel. After obtaining 
A and b, the optimal w* can be obtained by computing w* = A”! b. 

Let m be the number of features, n the number of samples, and p the 
number of slaves. Also denote (Xj, y¿) as the data set containing the samples 
on the j-th slave. The time complexities for computing Aj, bj, A, b, AT}, 
and A~'b are: 


1. Aj, Oceu (222) A; =X} X;, Xj ERE”. 


2. bj, Ocpu (22): b; = Xjyj;, Xj € REX, y; € RL, 


p 
3. A, Ocpu (m? log (p) ) + Ongr(m?log (p) ): A = X Aj, Aj e Rm. 
Ja 


P 
4. b, Ocpu(mlog (p)) + Oner (mlog (p)): b= © bj, bj € R”. 
j=1 
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6. A-!b, Ocpu (nm): AT! € R™™, be R™!, 


Here, Oc pu (-) denotes the computation cost spent on computing the solution. 
Oner (-) denotes the communication cost spent on transforming data from 
the slaves to the master through the network. Usually, the time consumed 
by one network operation is much longer than that used by one computation 
operation, time (Oner (1)) > time (Ocpu (1)). Therefore, in time complex- 
ity analysis, we count the number of operations required for computation 
and network transmission separately. For computing A and b, we need to 
send data from slaves to the master. Therefore, the time complexity for these 
two components contains both Ogpy (:) and Oy pr (-). Here we assume that 
a tree-based [63] distributed computation scheme, e.g., MPLREDUCE (see 
Section 5.2.0.5), is used for computing A and b. And we will explain why 
log (p) appears here in Section 5.2.0.5. Based on the above analysis, the time 
complexity of the parallel regression process is 


nm? 3 2 
Ocpu ae +m? | + Oxer (m7 log (p)), p=3. 


Assuming that n >> m, m > log (p), and the network is fast enough, we have 


nm? nm? 
Ocpu (= + m°) + Over (m? log (p)) = Ocpu ( 5 ) , p=3. 


This analysis shows that when sample size is large enough, the speedup of 
the regression process is linear. In the following sections, we show how similar 
processes can be used to parallelize spectral feature selection for handling 
large-scale data sets. 


5.2 MPI for Distributed Parallel Computing 


Before we try to parallelize spectral feature selection, we briefly intro- 
duce the message passing interface (MPI) standard.” MPI defines a standard 
interface for writing portable message-passing programs running on parallel 
machines. Figure 5.3 shows an example of how MPI works. Given a set of p 
computer nodes in the cluster, in most MPI applications, a set of p processes 
are created at the initialization step, and one process is created per com- 
puter node. These processes may execute different programs. Therefore, the 
MPI programming model is referred to as MPMD (Multiple Processes, Mul- 
tiple Data) model to distinguish it from the SPMD (Single Process, Multiple 


2www.mpi-forum.org. 


114 Spectral Feature Selection for Data Mining 


master 


Network Connection 


slave, 


FIGURE 5.3: Message passing interface (MPI) for distributed computing. 


Data) model. In the SPMD model, every computer node executes the same 
program. Therefore, compared with the SPMD model, the MPMD model has 
tremendous flexibility for handling complex problems. In an MPI-based appli- 
cation, one process may be assigned as the master (or root) and it coordinates 
other processes. The remaining processes are managed by the master and are 
called the slaves. In MPI, each computer node is associated with a number. 
The number is called the rank of the node. Usually the selected master node 
is numbered as rank = 0. 

MPI defines a set of functions to facilitate the communication among the 
processes running on different computer nodes [163]. Below we introduce three 
MPI commands that will be used frequently in the following sections. 


5.2.0.3 MPIBCAST 
MPLBCAST( buffer, count, datatype, root, comm ) 


INOUT buffer starting address of buffer 
IN count number of entries in buffer 
IN datatype data type of buffer 

IN root rank of broadcast root 

IN comm communicator 


MPLBCAST broadcasts the buffer of the process with its rank = root to 
all other processes. When communication needs to be made, all processes call 
MPI_BCAST. After return, the contents of root’s buffer are copied to all other 
processes” buffer. The first parameter of MPI BCAST, buffer, has different 
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roles for different processes. For the process with its rank = root, buffer is an 
input; for the processes with their rank Z root, buffer contains the output. 
Figure 5.4 shows how MPI_BCAST works. As shown in Figure 5.5, when the 
tree-based [63] distribution scheme is used for MPI Bcast implementation, 
to broadcast a vector of length n to p computer nodes, the communication 
complexity is Oy gr (n log» (p)). 


rank=0 P; Pa 
rank=1 P, | mpi-Bcast P, 
— 
rank=2 P; | Pz 
rank=3 P, PA 


P, buffer=1  MPI_BCAST(buffer, 1,INT,0,...)  buffer=1 
P,  buffer=0  MPI_BCAST(buffer, 1, INT, O,...)  buffer=1 
Pz  buffer=0  MPI_ BCAST(buffer, 1, INT, O, ...)  buffer=1 
P4 buffer=O  MPI_BCAST(buffer, 1, INT, O, ...) buffer=1 

BEFORE CALL MPI_BCAST AFTER 


FIGURE 5.4: MPI BCAST. 


5.2.0.4 MPI SCATTER 


MPI-SCATTER( sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) 


IN sendbuf address of send buffer 

IN sendcount number of elements sent to each process 
IN sendtype data type of send buffer elements 

OUT recvbuf address of receive buffer 

IN recvcount number of elements in receive buffer 
IN recvtype data type of receive buffer elements 

IN root rank of sending process 

IN comm communicator 


MPI SCATTER splits the sendbuf of the process with rank = root into 
segments of length equals to sendcount, and sends each of the other processes 
(rank Æ root) a segment. In the communication process, all processes call 
MPI_SCATTER. For the process with its rank = root, sendbuf is the input, 
and its length equals p x sendcount. For the processes with their rank Æ root, 
recvbuf contains the output, and its length equals recvcount. Note sendcount 
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FIGURE 5.5: MPI BCAST broadcasts data to p nodes in log, (p) iterations. 


and recvcount should be equal. Figure 5.6 illustrates how MPI-SCATTER 
works. To scatter a vector of length n to p computer nodes, the communication 
complexity is Over (n). MPLSCATTERV can be used, when data sent to 
different nodes have different length. 


P, P, 
P, Ti TT] MPI-SCATTER P, 
Se 


Ps LITT I P; 
Pa [IIT] Pa 


FIGURE 5.6: MPLSCATTER. 
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5.2.0.5 MPI REDUCE 
MPI_REDUCE( sendbuf, recvbuf, count, datatype, op, root, comm) 


IN sendbuf address of send buffer 

OUT recvbuf address of receive bufler 

IN count number of elements in send buffer 
IN datatype data type of elements of send buffer 
IN op reduce operation 

IN root rank of root process 

IN comm communicator 


MPLREDUCE combines the elements provided in sendbuf of each pro- 
cess using the operation op and returns the combined value in the recvbuf of 
the process whose rank is equal to root. Commonly used op operations sup- 
ported in MPI include MPI MAX, MPI MIN, MPI SUM, MPI PROD (prod- 
uct), and some logical operations. In the communication process, all processes 
call MPI REDUCE. For the process with its rank = root, recvbuf contains the 
output and its length is equal to count. For the processes with their rank 4 
root, sendbuf is the input, and its length is also equal to count. MPLREDUCE 
applies element-wise operation on the inputs. For instance, assuming that 
there are three input vectors x1, X2, and x3, when op = MPI MAX, the el- 
ements of the output vector o are computed by o; = max (21 ;,%2;, 83). As 
shown in Figure 5.8, when the tree-based [63] distribution scheme is used for 
MPLREDUCE implementation, to reduce p vectors of length n, the commu- 
nication complexity is Oy gr (n log (p)) and the computation complexity is 
Ocpu (nlog (p)). 


e = a+b+c+d 

p, jal | | | P, lel | | | 
P, Lo] | | | MPI_REDUCE P, | | | 
[TIT] 

[TTT] 


P, |c op=MPI_SUM Py 
Pa [a] | [|] Pa 


FIGURE 5.7: MPLREDUCE. 
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SUM SUM SUM SUM 
iter 1 o_O eo e+——_o eo 
1 1 1 1 1 1 1 1 
iter 2 e (e) (e) e O (e) (e) 
2 2 2 2 
iter 3 e o o o O (e) O (e) 
4 4 
finish e O (e) O O O O O 
8 
P} P, P; P, P, Pg P, Pa 


FIGURE 5.8: MPLREDUCE (MPISUM) adds up data on p nodes in 
log, (p) iterations. 


5.3 Parallel Spectral Feature Selection 


In this section, we study how to implement spectral feature selection al- 
gorithms in a distributed computing environment using MPI. The key idea 
here is to decompose the feature evaluation process into multiple steps, such 
that the majority of the computation of each steps can be written as sum- 
mations over samples. This allows us to distribute samples to the computer 
nodes of a cluster, so that the computation strategy introduced in Section 5.1 
can be utilized for parallel processing. To this end, we first list the major 
computation steps involved in spectral feature selection for both univariate 
and multivariate formulations. 
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5.3.1 Computation Steps of Univariate Formulations 


Let S be the similarity matrix and f be a feature vector. According to Sec- 
tion 4.1, the major computation steps of univariate spectral feature selection 
formulations include: 


1. Compute the similarity matrix S. 
2. Normalize S and f to obtain Ŝ and f , respectively. 
3. Compute feature scores, f'Sf. 


In real-world applications, we usually use a sparse similarity matrix. In a 
similarity matrix, for an instance x;, we keep S;,; only if x; (or x;) is in its k 
nearest neighbors of x; or (xj). The reason is that when the sample size is huge, 
the size of a dense similarity matrix will be tremendous. Assuming that the 
sample size is n, the size of the corresponding dense similarity matrix will be 
4n (n + 1) bytes,? assuming the matrix is in double precision, and each double 
precision float number occupies 8 bytes. From Figure 5.9, we can observe that 
when sample size n is 2 million, the size of the corresponding dense similarity 
matrix is about 14,901.17 GB.* Note that if the dimensionality of the data 
set is less than 2 million, which is almost always the case in a real-world 
application, the size of a dense similarity matrix will be even bigger than the 
original data set. 


size (GB) Size of a Dense Symmetric Matrix 
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7.0E+05 
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1.5E+06 
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1.9E+06 
2.0E+06 


FIGURE 5.9: Size of a dense symmetric matrix as n increase. 


3The similarity matrix is symmetric, we only need to store half of the matrix. 
414901.17 = 8 x (2 x 10%) x (2 x 10% + 1) + (230 x 2). 
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In SPEC, for noise reduction, we apply spectral matrix function y (-) to 
adjust the eigenvalues of the normalized similarity matrix (or the correspond- 
ing normalized Laplacian matrix) to penalize the high-frequency components? 
(see Section 2.4). It turns out that for a data set of very large scale, applying 
such an operation may be very expensive. First, the time complexity of com- 
puting a matrix-matrix multiplication is O (nè). When n is very large, this 
operation can be very expensive. Second, although our similarity matrix S is 
sparse, SP can be dense. And this matrix may be too big to store. Alterna- 
tively, we can achieve noise reduction by only using the leading eigenvectors of 
the similarity matrix. When the similarity matrix is sparse, using the Lanczos 
method [142, 38] to calculate a few eigenpairs only requires roughly O (n?) 
operations, and the computation can also be parallelized. 


5.3.2 Computation Steps of Multivariate 
Formulations 


For spectral feature selection with a multivariate formulation, the major 
computation steps are different for the supervised and the unsupervised cases, 
so we discuss them separately. 


e Unsupervised Case: 


1. Construct the similarity matrix S. 
2. Compute its top k eigenpairs and the target matrix Y. 
3. Solve the Lz; regularized multiple output regression problem 


arg min [Y — XW | +A IW] 
st. A={i: wėl > 0}, Card (A) = l. 
e Supervised Case: 


1. Compute the target matrix Y. 
2. Solve the Lz, regularized multiple output regression problem 


arg min [Y — XW | +A]WI|., 
st. A={i: IAR > 0}, Card(A) = l. 


Similar to the univariate formulation case, we use a sparse similarity matrix 
in multivariate spectral feature selection for unsupervised learning. 

In both univariate and multivariate spectral features, we need to compute 
the similarity matrix. In the next section, we study how to efficiently compute 
a sparse similarity matrix for a large-scale data set using a computer cluster. 


5For normalized similarity matrix, the high-frequency components correspond to those 
eigenvectors having small eigenvalues, and for normalized Laplacian matrix, they correspond 
to the eigenvectors with large eigenvalues. 
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5.4 Computing the Similarity Matrix in Parallel 


To construct a sparse similarity matrix S, we may retain S; ¿ only if x; 
(or x;) is among the k-nearest neighbors of x; (or x;). Typically k is a small 
number (k & n). 

Suppose p computer nodes are allocated in a computer cluster for parallel 
spectral feature selection, and our data are partitioned in p parts and stored 
distributedly in the p computer nodes. As shown in Figure 5.10(a), we want 
to construct S in parallel and store a portion of it on each computer node. 
The advantages of this strategy are twofold. First, by splitting S into many 
parts and storing them distributedly, we can put the entire S in memory. 
This allows us to do computation efficiently without accessing the hard disk. 
Second, by storing S distributedly, we show that the computation related to 
constructing S and evaluating features can all be done in parallel efficiently. 


5.4.1 Computing the Sample Similarity 
Assume that we compute sample similarity using an RBF kernel defined 
as 
K (Xi, xj) = exp (|x: — xjl?) : 
Figure 5.10(b) illustrates how the first column of S can be computed in parallel 


on the p computer nodes of the cluster. The process contains the following 
two steps: 
1 Node; broadcasts (MPI_Bcast) x; and ||x;||? to all the other computer 


nodes. 


2 Node, t = 1,...,p, computes similarity between x; and the samples on 
the node; 


K (x1,Xi) = exp (|[x1 — x; ||?) „i € Partition, t=1,...,p. 


Since |x; — x; ||? = |x|? + [Ix;[1? — 2x] xj, we can accelerate the dis- 
tance computation by precomputing ||x,||? for all samples, and caching them 
on the computer nodes. In this process, broadcasting x; and ||x;||? requires 
Oner ((m + 1) log (p)) for network transmission,® and the computation re- 


quires Oc py (m2) operations. By repeating this process for all samples, we 
can obtain the complete similarity matrix S. The total cost is 


ONET ((m + 1) n log (p)) + Ocpu (n=) A (5.5) 


SHere, we assume that the tree-based distribution scheme is used for MPI_Bcast imple- 
mentation [63]. 
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FIGURE 5.10: The similarity matrix S is computed and stored in a dis- 
tributed manner. 


5.4.2 Inducing Sparsity 


To construct a sparse similarity matrix efficiently, we can associate each 
row with a priority queue [96], which helps track the k largest elements in 
the row. Assuming that a binary heap is used for implementing the priority 
queue, the time complexity for obtaining the smallest element in the queue is 
O(1), and for insertion and deletion are both O (log (1)), where / is the size of 
the queue. Considering the computation involved in queue operations, so far, 
the cost for constructing S is 


ONET ((m + 1) n log (p)) + Ocru ((m + oe (k)) 5 ; (5.6) 


5.4.3 Enforcing Symmetry 


S should be symmetric. However, the S we obtained so far is not. To ensure 
that S is symmetric, we need to go through the following two steps: 


1. Construct a list for each row of S that contains the k nonzero elements 
of the row. In the list, the elements are ordered by their column index 
in ascending order. 


2. For each row of S, S,., comparing it with all other rows, S;., j Æ i. 
Update the elements in S,. and $,. by S;,; = Sj; = max (S; j, Sji). 


The first step of the above process forms a list of lists (LIL) representation 
of the sparse similarity matrix S [137]. The computation for converting the 
content of the priority queues can be finished in Oc py (2k log (1)) operations. 


The second step makes S symmetric, e.g., if Sj; = 0 and S;,; 4 0, then 
Sj; = max (S; ¿, 855) = S; j. The time complexity of this step is 


Over ( {klog (p) + Dn) + Ocpu (=) : 
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where | is the maximal size of the neighborhood of a sample. Note that | > k, 
since for each row some of the zero elements may be set to nonzero after 
the symmetry enforcement process. Figure 5.11 shows how we can make S 
symmetric by updating its rows. 


EANN S. ' i tii 
tt Se 5 
i ti Lig : 
— DA Sie 
tj; >i > Sji =0 tn 
(a) (b) 


FIGURE 5.11: Update rows to make S symmetric. 


First, as shown in Figure 5.11(a), for row S;,., we only send its section 
Si (i+1:n) to the rows $;41,:,...,Sn,, for comparison. The reason is that S; (1:4) 
and S(:), have already been updated previously. Second, in the comparison 
process, we always maintain a pointer for each list, which points to the position 
t of the row. Here t is the smallest column index corresponding to a nonzero 
element of the row and is larger than 7 — 1, e.g., S;4 > 0,t > 1— 1. These 
pointers help us find $,;,; in O(1) operation. When S; j 4 0 and S;; = 0, we 
update the row S;. by inserting S; ; in S;. as S;;. Similarly, if S; ; = 0 and 
Sii #0, we update S;. by inserting S; in S,. as S; j. After updating S, ;, 
for each row, if the t; is equal to 2, t; = i, then the pointer is updated by 
being pointed to the next nonzero element in the list. Each computer node 
also sends its updates for S; (¿+1:n) back to the computer node that holds $;, 
to update the original S;.. 


In On er ((k log (p) + t) n)+Ocpu (=), Over (nk log (p)) corresponds to 


the network transmission for broadcasting S$,,(;+1,n)- ONET (nt) corresponds 
to the network transmission for sending the updated for S;,, back to the com- 


puter node that contains S, (via MPI_Scatter). And Ocpu (=) corresponds 


to the computation required for updating the rows of S. Considering the cost 
of making S symmetric, the total cost of constructing a sparse symmetric 
similarity matrix S is 


Over ( (+1) log (p) + 8m) + Ocpu (n+) 7). 67 
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After obtaining the similarity matrix S, we may also need to extract the 
top k eigenpairs of S. This can be achieved by applying the implicit restarting 
Lanczos method [77]. The mechanism of this algorithm is beyond the scope of 
this book, and we refer readers to [38, 77] for details. It can be shown when 
p computer nodes are used for computation, the time complexity of a parallel 
implicitly restarted Lanczos method is 


(Ocro (e + (= + z) x (u— 1) 
+Oner {n (u — k) log (p)} \ x iterations, (5.8) 


where u is the total number of Lanczos vectors (Arnoldi length, u > k), t 
is the maximal neighborhood size of the samples, and iteration is the total 
number of restart iterations. Existing toolboxes, such as PARPACK” and 
ScaLAPACK,® can also be used to compute the eigenpairs of a symmetric 
matrix efficiently in parallel. 


5.5 Parallelization of the Univariate Formulations 


As discussed in Section 5.3.1, after obtaining the similarity matrix S, there 
are two steps for spectral feature selection with a univariate formulation, in- 
cluding (1) normalizing S and f to obtain $ and f, and (2) computing feature 
scores via f'S f. It turns out that in a distributed computing environment 
it is usually beneficial to combine the two steps, since this will reduce the 
unnecessary network communication among computer nodes. Below we use 
the second feature evaluation criterion in SPEC, (pa (-) as an example to show 
how a univariate spectral feature selection formulation can be parallelized. 

Let D be the degree matrix of S, and £ = D~? (D—S)D~? be the 

1 


1 2 1 
normalized Laplacian matrix. If €, = Pit and f = Pi£. we have 
11D21|| IID2£|| 
fTL Ê 
Oath) = — 
1- (Ea) 


TPARPACK: http://www.caam.rice.edu/~kristyn/parpack_home.html. 
8ScaLAPACK: http://www.netlib.org/scalapack/. 
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Plugging £, f, and & in the above equation, we obtain 


i= £ Sf 
pa(F) = f' Df 3 
1 (mx) 
_ ED) (fs) -1D (Fst) a 
7 (17 D1) (£ Df) — (f7D1)? a 


The equation shows that to compute p2(F), we need to obtain five in- 
termediate results, including D, f! Df, 1'D1, f' D1, and f'S f. The first 
four intermediate results can be written directly into summation forms over 
samples and computed in parallel: 


d; = 5 Si j (5.10) 
j=1 

f Df = Ds di = y N fas (5.11) 
E 

1' SI (5.12) 
j=11€P; 

n p 

dl e A RD Si =D fidi (5.13) 

i=1 j=1 i€Pj 


In the above equations, d; is the i-th diagonal element of the degree matrix 
D, and f; is the i-th element of the feature vector f. Assuming that the data 
set is distributedly stored on p slaves, Figure 5.12 shows that we can obtain 
the four intermediate results in three steps: 


1. Compute d; on each slave using Equation (5.10) in parallel. 


2. Compute Y) f? di, X di, and Ð` fi di, j =1,...,p on each slave in 
ie P; ie P; 1EP; 
parallel. 


3. Send local results from the p slaves to the master, and the master com- 
puting 1'D 1, 1'D f, and f'D f. 


In the first two steps, there is no communication between the slaves and the 
master. In the third step, the slaves need to send out their local results, and 
the communication cost is Oy gr (3p) for one feature. Therefore, the total 
computation cost of the three steps is Ocpy (n? + 65 + 3log (»)) for each 


feature. Note that in the process of computing feature scores for all the m 
features, the intermediate results for d; and 1' D 1 just need to be computed 
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(1) | (2) | (3) 
FIGURE 5.12: Parallel univariate spectral feature selection. 


once. Therefore, to compute the first four intermediate results, D, f' Df, 
1'D1, and f'D1, for all m features, the total cost is: 


Oner (2mp + p) + Ocpu (e + + 5m + 2m log (p) + log 0) 
= Oner (mp) + Ocpu (e + m, + m log 0) . (5.14) 


It turns out that computing f'S f is more complex than computing the 
other four intermediate results. This is because to compute f' S f, a feature 
needs to see all the elements in S, which are distributedly stored in the p 
slaves. Figure 5.13 shows an efficient way for computing f'S f. Since the data 
set is distributedly stored in p slaves, a feature f is partitioned to p segments, 
and each slave holds a portion of the feature. Let f = (£,.. GEN and 
we assume that f; is stored in the i-th slave. The similarity matrix S is also 


distributedly stored in p slaves, and each slave holds E rows of S. We denote 


S = (Sj... a and assume that Sj € R?*” is stored in the j-th slave. 
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Based on these notations, we can decompose the computation of f'S f as 


Sı 
O 2, E: 


p 
J 
= > i S; f 
j=1 


fi 
si ea S 
fp 
LS (£"S) fj. (5.15) 


j=1 


In the above equations, f!S € R!*” is an n-dimensional row vector, and 


n 


(f'S) € R'*> is the j-th segment of f'S. f'S is computed on the master 


j 


using the local results obtained by the slaves. And (f'S), is sent from the 


master to the j-th slave for computing (8), f;. Figure 5.13 shows how to 


compute f! Sf in five steps. 


1. 


p 
. Slaves send f] S; to the master and compute f'S = © ff S; 


] (£'S), fj are computed by the slaves, and the cost is Ocpy ( | 


Compute £] S; on the j-th slave. Here ES; € RIX” is a m-dimensional 


row vector. The cost of this step is Oo py (nz) 3 


(via MPI Reduce). The cost of this step is Ocpy (nlog(p)) + 
Oner (nlog (p)). 


. The master sends f! S; to the j-th slave (via MPL Scatter). The cost is 


ONET (n). 


w 
P 


. Slaves send the obtained (£TS) f; to the master to compute 


j 
p 
y (25), f; (via MPI Reduce). The cost of this step is Ocpu (log (p))+ 
j=1 


Oner (log (p)). 
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(1) on slaves (2) on slaves (3) on the master 
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(5) on the master (4) on slaves 


FIGURE 5.13: Parallel univariate spectral feature selection. 


When m features are evaluated, the total cost is: 
Oner (mnlog (p) + mn + mlog (p)) 


2 
+ Ocru (== + mn log (p) + = + m log 0) 
mn? 
= Oner (mnlog (p)) + Ocpu a + mnlog (p) } . (5.16) 


In Chapter 3, we study the multivariate formulations for spectral feature 
selection. In the next two sections, we show how to parallelize the multivariate 
formulations for large-scale spectral feature selection. 


5.6 Parallel MRSF 


Presented in Section 3.4, MRSF is a multivariate spectral feature selection 
framework based on sparse multiple output regression. Its key component is 
a multi-output regression with a sparse regularization on the weight matrix 
W, which has the form 


a 2 
arg min [Y — XW e +A |W] 
st. A= {i:||w'||, > 0), Card (A) =1. (5.17) 


As we show in Sections 3.6 and 4.2, by defining Y in different ways, we can 
achieve different types of multivariate spectral feature selection. The problem 
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defined in Equation (5.17) is a convex problem. In Section 3.4, we present 
an efficient solver, MRSF, for solving the problem, and the pseudo-code of 
the algorithm can be found below in Algorithm 7. It turns out that the major 
computational load of the algorithm comes from four key steps, which include: 


1. Initialize the active set (Line 2). 
2. Compute a tentative solution (Lines 4-10). 
3. Obtain an optimal solution based on the current active set (Lines 11-12). 


4. Check the global optimality of the solution obtained in Step 3 (Lines 
13-18). 


Algorithm 7: MRSF: Minimal redundance spectral feature selection 
(see Section 3.4) 
Input: X, Y, / 
Output: W 
WU =0, A =+00, i= 1 and RU! = Y; 
Compute the initial active set: Ay = arg max; ||f; RI ||ĝ; 
while i < | do 
Compute the walking direction ya,: Ya, = (Xa) ` Xj RU; 
for each j ¢ A; and an arbitrary t € A; do 
Compute the step size a; in direction ya, for fj to enter Aj. 
| IE] (RU — a7Xa,a,) lle = (1 a) lt ROI»; 
7 j* = arg minjga, Qj; 
A R T ES 
8 W = ((w" + 0j=YA,) .0) : 
9 | Å= AU), A =(1-03=)1£ RU; 
10 Solve the smaller optimization problem, 


OU A O N = 


ming, [Y — Xa W||? + A;||W]l2,1, using a general solver with W as 
the starting point; 

11 R = Y — XW; 

12 | if Vigă, |E Ro < A; then 


13 Ari = Á, with = W, Rett = R, 1=1+ T; 
14 else 
15 Â = (i: ISI #0} U (y + IE Bl > di}; 
16 Remove w’ from W, if ||| = 0, 
E E T 
W = (W”,0,...,0) , Goto line 11; 


= 


7 Extend WU to W by adding empty rows to Wl); 
8 return WH: 


= 
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Below we study how to parallelize these key steps to improve the scalability 
and the efficiency of the original MRSF solver. 


5.6.1 Initializing the Active Set 
We initialize the active set by solving the problem 
Ay = arg max; ||f] RUZ, RO! = Y. (5.18) 


Assuming that X and Y are partitioned and stored in p slaves, we have 
f = (£1,...,£1)', and Y = (YT,..., Y7)", where f; and Y; are the i-th 
partition of f and Y stored in the i-th slave, respectively. Using the above 
notations, we can rewrite f| Y as 


p 
PY=) f Y, (5.19) 


» fY D iv 


on slaves on the master 
FIGURE 5.14: Initializing the active set. 


As illustrated in Figure 5.14, to compute ||f!Y||, we first compute f; Y; 
on each slave in parallel, which results in p row vectors of C dimension. 
Here C is the number of columns of Y. The time complexity of this step 


is Ocpu (cz). We then aggregate f] Y;, j = 1,...,p to the master and 
obtain f' Y = Ya f; Y; (via MPI Reduce). The time complexity of this 
step is Ocpu (Clog (p)) + Oner (C log (p)). Finally, we compute ||£' Y|| on 


the master, and the time complexity is Oc py (C). We need to go through this 
process for all m features. Therefore, the total time complexity is 


Ocpu (me (= + log 0) ZA m) + ONET (motos (p) ) (5.20) 


where the additional CPU(m) corresponds to the cost of finding the fea- 


ture with the maximal ||£ Y || value. When pe? log(p), and Ocpu (=) > 
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Oner( log (p) ), we have 


Ocpu (me (= + log 0) ar m) + Over (me log w) ) 


~ Ocpru (me (2) (5.21) 


In this case, by using multiple slaves to construct the initial active set, we 
can achieve a linear speedup in terms of p. 


5.6.2 Computing the Tentative Solution 


After obtaining the active set of the ¿-th step, we need to compute a 
tentative solution based on the active set. The pseudo-code of this process is 
shown in Figure 5.15. It consists of three steps: 


1. Compute the walking direction ya,. 
2. Calculate the step size corresponding to each feature. 


3. Construct the tentative solution based on the feature with the shortest 
step size. 


Among the three substeps, the third substep can be computed on the 
master, and the other two can be computed on the slaves in parallel. Below, 
we show how the three substeps can be efficiently computed in detail. 


4 | Compute the walking direction ya,: Ya, = (XA Xa) X4, R”; 


5 | for each j ¢ A; and an arbitrary t € A; do 

6 Compute the step size a, in direction ya, for f; to enter Aj. 
E (RU a aXXa; Ja; ) llo == (1 ES o) ROI»; 

7 | end 


8 | J“ = arg minjga, Qj; 
Ă | T 
E, Mé ((wi iai 07-721)" ,0) 


10 | Â= AU), A = (1 — a4) 


E RII; 


FIGURE 5.15: Computing the tentative solution as in Algorithm 7. 


5.6.2.1 Computing the Walking Direction 


In the first step of computing the tentative solution, we need to compute 
the walking direction ya,, where ya, € R**© is the solution to the multi- 
output regression Xa, ya, = R. It can be computed in a similar way as in 
Section 5.1. In the i-th step, there are i features in the active set. There- 
fore, Xi, XA, is an į xi matrix. The time complexity for computing Xi Xa, 
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is Ocpu (72 +77 log (»)) + ONET (i? log (p)). Similarly, we can show that 
the time complexity for computing X/ R" is Oopu (ckz + Ck log (»)) + 
On pr (Ck log (p)), where C is the number of columns of RI. After obtaining 
X[ Xa, and X] RM, (X] Xa) ` XI, R" can be computed on the master, 
and the time complexity is Oc py (3 + k?C), In the above process, for Xi, Xa, 
and X} RM, p local results are computed on the slaves, and the global result 
is obtained on the master by aggregating the local results (via MPI_Reduce). 


(X4, Xa)” X], R" is also computed on the master. Assuming * > C log (p), 
and > > 1, the total time complexity of this step is 


Oa (6 + Ci) z) + Over( (i? + Ci) log (p) ). (5.22) 


The above equation shows that when 7 is small, and the network is fast enough, 


the time complexity is dominated by Ocpy ((? + Ci) z). In this case, the 
speedup ratio of the computation is linear in terms of p. 


5.6.2.2 Calculating the Step Size 


In the second step of computing the tentative solution, we need to compute 
the step size a; corresponding to each feature fj. a; defines the minimal length 
we need to proceed in the direction of ya, until f; enters the active set.” To 
compute a, we need to solve the equation 


If (RM — a;Xa,a;) lle = (1 — a IT RM. 


As shown in Figure 5.16, aj can be computed in four substeps. 

In the first substep, we compute three row vectors: f RU, f] Xa, 
and f/ R, where f} RË, f RË e RI*C, and ff XA, € R!*i. Each 
of the three vectors can be computed in parallel in a similar way as 
we show in Section 5.6.1. The time complexity of computing E R! and 


ET Xa, is Ocpu ((C +i) 2 + (C +i) log (p)) + Over ((C + i) log (p)). Sim- 
ilarly, the time complexity of computing f, RI is Ocpu (cz + Clog (»)) + 
Oner (Clog (p)). After the first step, £} R”, £7 XA,, and f; RI are all ob- 
tained on the master. 

In the second substep, we compute £ Xa; Ya, on the master, which requires 
Ocpu (iC) operations, and results in a C dimensional vector. 


In the third substep, we compute ||£, R!!||. The computation is done by 
the master. And the time complexity is Ocpy (C). 


9 A feature enters the active set when it has the same amount of correlation to the residual 
as the features in the active set. 
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FIRII o = jfr] 
T lil 
fR a aj 
fi XA = f XaVa; 2 b 
slaves ») master master master master 


(1) (2) (3) (4) 


FIGURE 5.16: Computing the step size a; in four substeps. The step 1 ac- 
counts for the major computational load for calculating a, and is parallelized. 


In the fourth substep, we compute a;. Let c = ||f/ RI, a! = £R", and 
bl = £} Xa, Yai aj; can be obtained by solving the problem 


za? — 2ya + z = 0, 0<a<l, 
where a=b'b-e?, 


y=alb-e?, 
z=ala—c?, (5.23) 


The time complexity of solving this problem is Ocpu (C). 
If we assume that £ > C log (p), > > i, and m > i, the time complexity 
of computing the step size for all m — i features is 


Ocpu (m (C +i) z) + Over(m (C + i) log (»)). (5.24) 


Note that c in Equation (5.23) can be shared by all features, and just needs 
to be computed once. 

The above analysis shows that the substep 1 accounts for the major com- 
putational load for calculating a;. Since the majority of the computation is 
done by the slaves, when i is small and the network is fast enough, we can 
achieve linear speedup in terms of p. 


5.6.2.3 Constructing the Tentative Solution 


In the third step of computing the tentative solution, we need to update 
the weight matrix W, the active set, and A. The computation is done on the 
master and the time complexity is Ocpy (log (m) + iC). 
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5.6.2.4 Time Complexity for Computing a Tentative Solution 


Assuming that m > i, based on the above analysis, the total time com- 
plexity for computing a tentative solution according to A; is 


Ocpu (m (C +i) z) + ONET (m (C + i) log (p) ) (5.25) 


It shows that when m > i, the computation in the second step (calculating 
the step size) dominates the computational load. And when n is large enough, 
increasing the number of slaves leads to linear speedup in terms of p. 


5.6.3 Computing the Optimal Solution 


After obtaining the tentative solution, we need to compute a solution that 
is optimal on the current active set with the updated regularization parameter 
A. This corresponds to solving a smaller optimization problem defined as 


minyy [Y — X ¿WI? +A Wa. (5.26) 


As shown in Section 3.3, this problem can be solved efficiently by either the co- 
ordinate gradient descent method or the accelerated gradient descent method. 
In the following, we use the accelerated gradient descent method as an exam- 
ple to illustrate how the method can be parallelized for solving the problem 
specified in Equation (5.26). 

Algorithm 8 contains the pseudo-code of the accelerated gradient descent 
method. It defines an iterative process to compute the optimal solution of 
Equation (5.26). In each iteration, three steps are taken to compute W,, the 
weight matrix of the j-th iteration: 


1. Compute 8; and Sj. 
2. Calculate the minimizer of the model function, Mg,,, (W). 
3. Update L; and Qj+1- 


Since Wo is stored on the master and is small, we can compute Steps 1 
and 3 on the master. Below we show how the second step can be decomposed 
and accomplished in parallel. 

In the second step, we need to compute the minimizer of the model function 
Ms, 1 (W). As shown in Section 3.3, this boils down to computing the matrix 
V, which is defined as 


1 
V = 8; — q XA (XS; — Y). (5.27) 
J 


And the weight matrix W can be derived from V using the equation 


i v|1-——)], when |lv'|lo > 
-f (ón) a, (5.28) 
0, when ||v’||2 <p 


z 
| 
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Algorithm 8: Accelerated gradient descent method (see Section 3.3.2) 
Input: XA, Y, A, Wo, Lo > 0, M 
Output: War+1 
Wi; = Wo, Qi 0, Qo 1, L Lo; 


By = 222, Sy = Bj (Wy — Wj-1); 
for L = Lj 15 2L; 13 AL; lo... do 

W541 = arg min Ms, (W); 

if $Y — Xa Wi+11l% +A [Wi+all, < Ms, 2 (W541) then 
7 | break; 


NBA O Ne 


1+4/1+4a? 
8 M L; = L, Qj+1 = i Sel ; 


9 return Wyr+1; 


In the above equation, p is defined as: p = A. The equations show that the 
major computational load in the second step comes from computing V. Since 
S, V, and W are all of small size, once we obtain V, we can easily obtain W 
on the master. Below, we demonstrate how to compute V in parallel. 

To compute V, we need to compute two components including X} XA € 
RX? and X,Y € R'*C. Assuming that Xp and Y are both partitioned 
into p parts, Xa, Yi, i = 1,...,p, and each part is stored on a slave, the 
computation of XIXa and X.Y can be decomposed and done in parallel as 


P 

XiXa = Y Xi Xai (5.29) 
i=1 
P cedo 

XY = X XL Yo (5.30) 


The decomposition suggests that we can compute XI Xai € Ri*? and 
X.Y; e RC on the i-th slave and aggregate the local results on the mas- 
ter (via MPI Reduce) to obtain X} XA and X} Y. The time complexity of 
computing X/ XA and X,Y is 


Ocpu G + iC) a + (i? + iC) log 0) + Over ((i? + iC) log (p)) . (5.31) 


After obtaining XA XA and xT Y, V can be computed on the master as 
1 1 
V = 8; — p Xa (XS; — Y) = Sj — El (XA XS; — XA Y). (5.32) 
j j 


And the corresponding time complexity is Oc pu (120). Based on V, we can 
obtain W using Equation (5.28), and the time complexity is Ogpy (iC). 
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After obtaining W +, we need to check whether the current L is valid by 
requiring 


1 
5 [Y — XA Wall +A (Witla; < Ms; (W541). 


We can show that 


1 
5 [Y Wall +A IWsallo (5.33) 
1 1 
= z Trace (Y Y) + z Trace (Wa Xa XA Wa) 
— Trace (Wj, X,Y) + A ja za 
and 
1 
Ms, a (W) = 3lXaS; - Vll + Trace ((X48; — Y)” Xa (S; — W)) 


L; 
+A|Wisillaa + #2 Trace ((8; — W541)" (S; — Wy+1)) 
J Xa XaS;) — Trace (S} XA Y) 
+Trace (S} X{ Xa (S; — W)) — Trace (Y' Xa (S; — W)) 


= Í Trace (YY) + 5 Trace (S 


Ls 
PAI ra la + “A Trace ((S;— Wii)” (Sj — Win). 


The above two equations show that in the validation process, except Y ' Y, 
all the computations involving Y and XA are in the form of XA XA or Xi Y, 
which have already been computed on the master in the last step. Since Y ' Y 
appear in both 4 ||Y — XA Wu +A Wj+1ll, and Ms, a (W), and can 
be canceled, we do not need to compute it in the validation process. The time 
complexity of the validation step is Ocpu (i?C + iC?). 

After the validation step, we need to send W = W m+1 back to the slaves 
(via MPI_Bcast) for computing R;, i =1,...,p, which is the residue corre- 
sponding to W, and will be used in the next step. Since R; = Y, - XW, 
the time complexity for computing it is 


n 
Ocpu (ic) + Over (ic log (p) ) . (5.34) 
In summary, assuming that 4 > log (p), the total time complexity of 
computing the optimal solution based on the current active set is 


n 

Ocpu (mu (i? + iC) z) + ONET (mu (a + iC) log (p) i (5.35) 
where M is the maximal number of iterations specified in Algorithm 8, and 
lz is the averaged number of tries for searching the proper L in the validation 
process. The analysis shows that when 7 and C are small and p is large, we 
can obtain near linear speedup in terms of p. 
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5.6.4 Checking the Global Optimality 


Finally, we need to check whether the obtained W is globally optimal. The 
pseudo-code of this step is presented in Figure 5.17. 


13 if Vigă, JIE Rilo < < A; then 


14 | Au =A, WHI = W, RHI = Ř, i= i+ 1; 
15 else 
ro | A LOU: IE RI > as} 
z 
17 Remove w' from W, if ||w'|| = 0, W = (wr, 0,.. 0) , Goto line 11; 
18 end 


FIGURE 5.17: Checking the global optimality. 


In this step, we need to compute the correlation between each unselected 
feature f; and the current residual R, then check whether the correlation is 
larger than A. Assuming there are features with correlations larger than A, we 
want to pick the feature with the rece correlation and add it to the active 
set. In this process, computing f; R is the most expensive part. Its compu- 
tation can be decomposed and dongi in parallel as described in Section 5.6.1. 
We can show that the time complexity of this step is 


seir G (m1) +C (m—i)log (0) +Oner(C (m- i)1og (p)). (5.36) 


When 5 > log (p), this can be simplified to 
Ocpu (cima) + Over(C(m — i) log (p) ). (5.37) 


5.6.5 Summary 


Tables 5.1 and 5.2 summarize the time complexity of the four key steps 
listed on page 129, and the computations that can be parallelized in these 
steps, respectively. In the analysis, we assume that p m © l,C,p. 

The first step is performed just once at the very beginning. The second 
step will be done / times, where / is the number of selected features. The third 
and the fourth steps will be done [x ly times, where ly is the averaged number 
of backtraces for adjusting the active set Á, when the solution obtained on Á 
is not globally optimal. When these factors are considered, we can obtain the 
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TABLE 5.1: The time complexity of each step. 
Step Time Complexity 


1 Ocru (mcz) + ONET (mc log (p) ) 
2 Ocpu (m (C +i) 


) + Ower(m( 
3 Ocru (mu (? + iC) 2) + Over (Mt (i 2 + iC) log (p )) 
) + Oner(C( 


+ Oner|m(C +1) log ( »)) 


4 Ocru (C(m-i) 1) + Onegr C (m m—i ) log ( »)) 


TABLE 5.2: Parallel computation involved in each step. 


Step | Parallel Computation 

1 | £ Y 

2 | XíXa, X(RÚ, IR, XA, £ Ri" 
3 | X Xi, X|Y, Y -X¡¿W 

4 | FR 


J 


total time complexity of parallel MRSF as 


l l 
Ocpu (ne + Som (C +i) LY X Mizlv (i? + iC) ; 


= t=1 
+ Wve (m-i) z) 


S pe (mc +m (C+) + Mily? (+ 0)" + Clyl(m -0?) 
Pp Pp p p 


II 


Ocpu (u (Cly +1) + Mlzlv?? (l+ C) ) z) | 


Large-Scale Spectral Feature Selection 139 


l l 
ONET ((ne+ Emcro+ Sac (i? + iC) 


= i=1 
+ > lyC (m — D) log 0) 


Opie (ue +mi(C+0)+ Mil (14 C) + Clyl(m—1) ) log (p) ) 


Deba (u (Cly +1) + Mirly? (1+ C) ) log (p) ) | 
Total time complexity 

2 n 
Desi (mi (Cly +) + Miriy? (+0) ) 


+ ONET (C (Cly +0) + Mlzlyl? (L+ C) ) log (p) ) . (5.38) 


The decription of the variables in Equation (5.38) can be found in Ta- 
ble 5.3. 


TABLE 5.3: Description of the variables in Equation (5.38). 


Variable Description 
m number of features 
n number of samples 
p number of slaves 
l number of selected features 
C number of columns of Y 
M number of accelerate gradient iterations 
ly number of backtrace for adjusting Á 
lt number of iterations for searching L 


5.7 Parallel MCSF 


Presented in Section 3.5, MCSF is a multivariate spectral feature selection 
framework based on matrix comparison. In MCSF, we want to select a set of 
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k features such that the linear kernel constructed on the features is close to 
the given sample similarity matrix S, 


min |S — X 4X}, (5.39) 
where Xa = A 19 JA ij EA, Jim Arrera ki 


Assuming that features have been normalized to have unit norm, we 
showed in Section 3.5 that MCSF needs to run k iterations to select k features. 
And in each iteration, it selects a feature using the criterion 


argmaxyga, fj Rifj, (5.40) 


where R; =S- X AXA, is the residual matrix in the i-th iteration. 

It turns out that for a large-scale problem, we cannot use Equation (5.40) 
to select features. The reason is that the residual matrix R; may not be sparse. 
When the sample size is large, we may not be able to store it, even on the 
hard disk. To address this problem we can rewrite the computation specified 
in Equation (5.40) as 


arg maxjga,; f Rif, = argmax;¢a, f (S — Xa Xa,) fj 
= argmax;ga, (£; Sf; — fj Xa, XA, fj). (5.41) 


In the first iteration, A; = @, and we select the first feature by using the 
criterion 


poa. 


mi Si (5.42) 


Assume that i features have been selected. Let the i-th selected feature be 
fl], and the score of the j-th feature f; in the i-th iteration be sc; (j). We can 
show that the score of f, in the (i + 1)-th iteration is 


sciali) = ff (S-XaXa,)£; 
= E (S-Xa XK, — fle’) f 


= ff (S-Xa_. O 5- g eels, 


= sei(j) E fe f. 
Therefore, given sc;(j), sci+1(j) can be computed as 
sc: (j) = sei(j) — ET Fle q, (5.43) 


Equation (5.43) shows that after selecting the first feature, in each of the 
PEE 
remaining iterations, we can update feature scores by subtracting f7 f/f! f; 
from their current score. We then select the feature with the largest score, and 
continue with the next iteration. 


Large-Scale Spectral Feature Selection 141 


As shown in Section 5.5, £ Sf; can be computed in parallel, and the time 
complexity is 


2 


Ocpu (= + mn log 0) + Onrr (mn log (p)). (5.44) 


Similarly, after the first step, we can parallelize the computation of each iter- 
ation specified in Equation (5.43), which has the time complexity of 


desp (= iag 0) ON (5.45) 


Therefore, for selecting l + 1 features, the total time complexity is 


Ocpu (a+) (Z2 mios (0)) ) +Oxer( (n + Dmtezto) ). (5:46) 


The above equation shows that when n > l, the major computational load 
of parallel MCSF comes from the first iteration of the algorithm. Also, when 
A > log (p), and the network is fast enough, we can obtain near linear speedup 
as the number of computer nodes p increases. 


5.8 Discussions 


In this chapter, we discuss how different univariate and multivariate for- 
mulations for spectral feature selection can be implemented in a distributed 
computing environment. The key is to decompose the computation involved 
in these formulations into various summation forms over the samples. The 
decomposition allows us to parallelize the computation by computing local 
results in parallel on the computer nodes of a cluster. And the global results 
can be efficiently obtained by aggregating the local results. Our analysis sug- 
gests that the technique presented in this chapter can effectively improve the 
scalability and the efficiency of spectral feature selection algorithms. In most 
cases, the technique provides a near linear speedup, as the number of computer 
nodes used for computation increases. 

When we face the small sample problem, if the number of features is larger, 
we will not have enough information to reliably estimate the relevance of 
features. The small sample problem is a very challenging problem in feature 
selection [143, 93, 42, 41, 159, 125]. Using multiple knowledge sources available 
for feature selection provides a promising way of handling this problem. In 
the next chapter, we study how to use multiple knowledge sources in spectral 
feature selection to achieve multi-source feature selection. 
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Chapter 6 


Multi-Source Spectral Feature 
Selection 


One challenging problem in many feature selection applications is the small- 
sample problem [143, 93, 42, 41, 159, 125], where the dimensionality of data 
is extremely high, while the sample size is very small. For instance, a typical 
cDNA microarray data [88]) used in modern genetic analysis usually contains 
more than 30,000 features (the oligonucleotide probes), but the sample size is 
often less than 100. With so few samples, many irrelevant features can easily 
gain their statistical relevance due to randomness [159]. With a data set of this 
kind, most existing feature selection algorithms become unreliable by selecting 
many irrelevant features. For example, in cancer study based on cDNA mi- 
croarray, researchers found that traditional feature selection algorithms offer 
limited or inaccurate selection of biological features [118, 159]. Fold change! 
is a popular method used in gene selection. 

To study its actual performance when sample size is small, we obtain a 
microarray data set from Gene Expression Omnibus (GEO) [11] with the ref- 
erence id GSE2403. We randomly partition samples into positive and negative 
groups with 10 samples in each group. We then apply the fold change measure- 
ment on the split sample to identify significantly regulated genes. We repeat 
this process 10 times, and the number of significantly regulated genes identi- 
fied each time is shown in Figure 6.1. On average, we identify 12.7 significantly 
regulated genes on each random split. We also apply the t-test [123] on the 
original split,? and identify 16 significantly regulated genes, which is only a 
little bit larger than the average number obtained on the random splits. This 
example shows that when sample size is small (20), and the number of fea- 
tures is very large (11,362), many features can be identified as significant on 
an arbitrary split of the samples. This implies that on the original split, some 


lThe ratio between the positive sample mean, u+, and the negative sample mean, p_, 
of the expression of a gene. When uy < p_, the ratio is computed as: w4/u— and the 
gene is called up-regulated. Otherwise, it is computed as y_/u+ and the gene is called 
down-regulated. To improve reliability, we usually apply the t-test [123] to verify whether 
the positive and the negative sample means are statistically different. When the p-value 
computed from the t-test is small enough, e.g., <0.05, and the fold change is large enough, 
e.g., >2, we say the gene is significantly regulated. 

2The data are split into the positive and the negative sample sets using the class label 
information. 
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of the significant features identified by fold change may gain their statistical 
significance by sheer randomness. 


Number of Significant Genes According to Fold Change 


1 2 3 4 5 6 7 8 9 10 


FIGURE 6.1: Significantly regulated genes identified by fold change when 
samples are randomly split into positive and negative groups. The dash line 
stands for the number of significantly regulated genes identified using the 
original split. 


To improve the reliability of statistical analysis, we need to increase sample 
size. However, in many real applications, it is impossible for us to increase the 
size considerably, since the process of acquiring additional samples can be very 
costly. One way to address this problem is to include additional information 
sources to enhance our understanding of the data in hand. For instance, the 
recent developments in bioinformatics have made various knowledge sources 
available, including the KEGG pathway repository [87], the Gene Ontology 
database [25] and the NCI Gene-Cancer database [151], etc. Recent work has 
also revealed the existence of a class of small noncoding RNA (ribonucleic 
acid) species known as microRNAs, which are surprisingly informative for 
identifying cancerous tissues |118]. The availability of these various knowledge 
sources presents unprecedented opportunities to advance research by solving 
previously unsolvable problems. However, most feature selection algorithms 
are designed to handle learning tasks with a single data source, and cannot 
benefit from any additional knowledge sources. 

To address this limitation, in this chapter, we discuss novel multi-source 
feature selection algorithms for integrating different types of knowledge in 
the feature selection process. In different domains, the relationships among 
knowledge sources can be very different. To facilitate our study, we focus on 
genetic analysis based on microarray data. We show that using different types 
of knowledge such as gene annotation and biological network can improve 
the reliability of gene relevance estimation. In genetic analysis, different types 
of knowledge describe genes or samples from different perspectives and have 
heterogenous representations. One major challenge in multi-source feature se- 
lection is how to address the heterogeneity in different types of knowledge 
sources. In the following, we first categorize different types of knowledge that 
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can be used in multi-source feature selection, which helps us identify the com- 
mon characters of the knowledge from the same category, to facilitate us in 
handling the heterogeneity of knowledge representations. We then describe 
two multi-source feature selection algorithms. One is based on the combina- 
tion of similarity matrices, and the other on rank aggregation. Both algorithms 
can effectively integrate multiple knowledge sources with heterogeneous rep- 
resentations. 


6.1 Categorization of Different Types of Knowledge 


In order to handle different types of knowledge properly in multi-source 
feature selection, we first study how to categorize different types of knowl- 
edge for feature selection. To facilitate analysis, we fix our domain in genetic 
analysis, where genes correspond to features. 


O Nucleic acid 


| 
P14-ARF | metabolic 
Jl | Ñ 
Transcription © 
| 
MDM2 BCR-ABL | y 
| Transcription Regulation 
Al: A | DNA-dependent o — O of transcription 
| 
| 
P53 4 CHK1 l Transcription Y Regulation of 
| from RNA pol-il O _ transcription 
A l promoter Eg DNA-dependent 
| | 
| oY 1 y yy 
ATR i e) O O O o 
l ARNT RPB4 CTK1 CTK2 CTK3 
(a) | (b) 


p53: cctggagcacggaagattctctcctccagccgaggactacccgatcgtcgtigtgcgga ... 
(c) mdm2: gctttgttaacggggcctcccgtgagtctggacatctgcgctatgccactctggccgagcc ... 
ras: gaattccggtgtgtgggaccgtgggatccccattcagctgccagcgtctctictggcagca ... 


FIGURE 6.2: Knowledge of genes (features): (a) metabolic pathway, (b) gene 
ontology annotation, and (c) gene sequence. 


Categorically, there are two types of knowledge: knowledge about features, 
and knowledge about samples. The knowledge about features usually contains 
information about the properties of features and their relationships. Figure 6.2 
presents three different types of knowledge about features (genes) that can be 
used in genetic analysis: (a) metabolic pathway, which depicts a series of bio- 
chemical reactions occurring in cells and reflects how genes interact with one 
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another to accomplish a specific function; (b) gene ontology (GO) annota- 
tion [25], which uses a controlled vocabulary to describe the characteristics of 
genes; and (c) gene sequence, which describes the order of the nucleotide bases 
of genes. The figure shows that the three types of knowledge have heteroge- 
nous representations. The nature of the knowledge determines how it can be 
used in feature selection. According to the way knowledge is used in feature 
selection, we further divide different types of knowledge into three categories: 


1. Knowledge about feature similarity: KE EA 


For instance, with gene sequence information, gene similarities can be 
obtained by applying a sequence alignment algorithm. 


2. Knowledge of feature functions: KEEN: 


For instance, in a metabolic pathway, a set of genes act together to 
accomplish a biological function. The functions of genes are also provided 
in gene ontology annotation. 


3. Knowledge of feature interaction: KEZA. 


For instance, in the BioGRID [169], over 198K genetic interactions re- 
lated to different types of biological processes are recorded. 


The knowledge of features is accumulated and cross-examined by human 
researchers via multiple independent experiments. Therefore, it is relatively 
reliable, and independent of any specific experiment. 

The knowledge of samples is usually about sample categories, K24\, or 
their similarity, A, Samples can be categorized by either a flat structure, 
as shown in Figure 6.3(a), which forms the standard class label, or a hier- 
archical structure, as shown in Figure 6.3(b). The similarity among samples, 
depicted by the pairwise sample similarity matrix, can be derived from a given 
auxiliary data set. Auxiliary data refers to the data containing additional in- 
formation of the same set of samples in the target data. The target and the 
auxiliary data depict the same set of samples, while using different measure- 
ments. Auxiliary data may help us get a better understanding of the geometric 
pattern of the samples. For example, as shown in Figure 6.3(c), for gene se- 
lection, the microRNA microarray can serve as auxiliary data, which measure 
the microRNA expression of samples. cDNA microarray (messenger-RNA mi- 
croarray) and microRNA microarray are collected from the same set of sam- 
ples. Compared with cDNA microarray, microRNA microarray contains the 
expression of only several hundred microRNA and is found to be surprisingly 
informative in separating tissues of cancer from noncancer, as well as tissues 
of different types of cancers [78]. Using microRNA microarray as auxiliary 
data helps improve our understanding about how cancerous samples cluster 
together. Compared with knowledge about features, knowledge about samples 
is obtained from an individual experiment. Therefore, it is more specific. 

Table 6.1 summarizes different categories of knowledge that can be used 
for feature selection in genetic analysis. Some types of knowledge may fall 
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(a) Acute lymphoblastic leukemia (ALL) | (P) EU 

Chronic Lymphocytic Leukemia (CLL) E 

LL ML 

Acute Myelogenous Leukemia (AML) 

Chronic Myelogenous Leukemia (CML) ALL CLL AML CML 
(c) Gi Gp mG, mG, 

lı 
mRNA miroRNA 
Microarray Microarray 
(Target Data) (Auxiliary Data) 


In 


FIGURE 6.3: Knowledge of samples: (a) the class label information, (b) a 
sample hierarchy, and (c) an example of the auxiliary data. 


TABLE 6.1: Categories of different types of knowledge for gene selection. 


| Kaa - Category | Class Label, Sample Hierarchy 


S l 
GA GAN rta miRNA Expression Profile, 
Kfm - Similarity Sa 
Knowledge mRNA Expression Profile 
Gene Sequence, Gene Ontol- 
KEES - Similarity | ogy Annotation, Gene Lin- 
Feature eage, Gene Locus 


Gene Ontology Annota- 
KEES - Function | tion, Metabolic Pathway, 
Gene-Disease Association 
Metabolic Pathway, Protein- 
Protein Interaction 


| KE AA - Interaction 
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into more than one category. For instance, gene ontology annotation can be 
used for obtaining the knowledge of both gene similarities and gene functions: 
gene similarities can be obtained by comparing the annotation terms shared 
among genes; and knowledge of gene function can be identified by checking 
the annotation terms related to gene functions. Different types of knowledge 
describe features or samples from different perspectives. The categorization 
of different types of knowledge helps us identify their common characters and 
allows us to develop common approaches to analyze the knowledge from the 
same category. 


6.2 A Framework Based on Combining Similarity Ma- 
trices 


Given multiple knowledge sources carrying information about features and 
samples, we need to exploit them effectively in feature selection. The hetero- 
geneity of knowledge representations necessitates a common way to represent 
knowledge that meets the following requirements: (1) information can be easily 
extracted for both features and samples; (2) information can be combined for 
integration; and (3) information can be effectively used for feature selection. 
In this section, we use the similarity among samples as the common repre- 
sentation. We show: (1) given relationships among features, we can obtain 
the similarity among samples; (2) upon obtaining sample similarity matrices 
from various knowledge sources, we can combine them to form a global sample 
similarity matrix; and (3) using the obtained global sample similarity matrix 
we can select features using spectral feature selection algorithms. 

The high level idea of this multi-source feature selection approach is shown 
in Figure 6.4 as three steps: 

(1) Knowledge Conversion — knowledge understandable for human beings 
may not be directly applicable in a learning model. Therefore, the first step 
is to extract knowledge for learning. Assume we have L different knowledge 
sources K¡,...,K y. For the i-th knowledge source, we can apply a conversion 
operator c; (-) to extract a local specification of the sample similarity matrix, 
S;. This allows us to formalize the knowledge conversion step as 


(2) Knowledge Integration — Given multiple local sample similarity ma- 
trices, we can obtain a global similarity matrix by linearly combining local 
similarity matrices [228], 


L 
Sglobal = 5 aiS;, (6.2) 
i= 


where a, is the combination coefficient, which can be assigned by domain 
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FIGURE 6.4: The framework of multi-source spectral feature selection. 
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experts according to their domain knowledge [228]. If the label information is 
available, a; can also be learned automatically via convex optimization, which 
is related to kernel learning. The study of kernel learning goes beyond the 
scope of this book, and we refer readers to the literature for a comprehensive 
introduction [99, 206]. 

(3) Feature Selection — after the Syovar is obtained, it can be used in 
a spectral feature selection algorithm for feature selection. Since Sglobal is 
constructed using multiple knowledge sources, the feature selection conducted 
in this step forms a type of multi-source feature selection. 

Since Step 1 is source-dependent, we discuss next how to define the conver- 
sion operators c (-) to extract local sample similarity from different knowledge 
sources with heterogeneous representations. 


6.2.1 Knowledge Conversion 


The conversion from ae (the knowledge of sample category, e.g., class 


label) to KSAM (the knowledge of sample similarity, e.g., sample similarity 
matrix), Na > Sena , 1s straightforward. For example, given the class 
label y, we can use the following equation to obtain the sample similarity 
matrix S: 7 
a=] no Y=Y=l | 
K 0, otherwise 


As we show in Chapter 4, by applying in the spectral feature selection 
framework SPEC, we obtain the Fisher score feature selection algorithm. Be- 
low, we discuss how to perform conversions from other knowledge sources to 
the knowledge of sample similarity: 


FIS 
sE 


e KEEA, knowledge of feature similarity 
> KN , knowledge of sample similarity. 


. KREN, knowledge of feature function 


> KPM , knowledge of sample similarity. 


e KEEA, knowledge of feature interaction 


> KCEAM knowledge of sample similarity. 


Figure 6.5 illustrates how to convert K5A and KEES to KZAM. The 
idea is to use KE PA and KELA to influence the computation of the pairwise 
sample similarity. When Ke Ea is given, we can use it to compute the feature 
covariance, which then can be used in Mahalanobis distance [122] to derive 
the pairwised sample similarity. When KEZA is given, we can use it to filter 
the data, and then compute the pairwised sample similarity using the filtered 
data. Below we show how to convert KEET: KERA, and KEBA to KZAM in 


detail. 
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FIGURE 6.5: Converting KEEF and KEFA to KSAM. 


6.2.1.1 KEE! —> KSAM 


Given similarities among features, feature covariance can be constructed 
and used in calculating the pairwise sample similarity via Mahalanobis dis- 
tance [122] as 


Ix- yli = œ- y)" C (x-y), (6.3) 


where x,y € R” are two samples with m features F1,..., Fm, and C € R™*™ 
is the covariance matrix. In comparison with the standard Euclidean distance, 
Mahalanobis distance provides a better way to determine the similarities 
among samples by considering the probability distribution of the underly- 
ing model. The ellipsoid best representing the probability distribution can be 
estimated from C [72]. In real-world applications, C is usually estimated using 
the equation 


C= dolar (x-3), (6.4) 
k=1 
where X1,...,Xn are the n samples of the data, and X is their mean. Let 


I— 211' be the projection matrix [59] that centralizes the data to have zero 
mean. It has the properties 


(x1 =%....%1-) =x (15107), (6.5) 
(1- EL ` (1- 17) (6.6) 
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(1- 117) (1- 117) = (1- “1 i; (6.7) 


Hence, we can also write Equation (6.4) in the form 


Gee! x (1- 2117) x'. (6.8) 
n—1 n 

Although Equation (6.4) specifies an unbiased estimator of the covariance 
matrix, when sample size is small, its estimation can be poor [97]. In addition, 
the covariance matrix can also be obtained from our knowledge of feature simi- 
larities, providing another (maybe more stable and reliable) way for estimating 
C. For instance, in genetic analysis, the similarities among genes (features) are 
usually specified by: graphs (or kernels), e.g., biological pathway and protein- 
protein interaction; or they can be derived from gene descriptions, e.g., gene 
annotation [25]. The similarities among features are usually described by a 
similarity matrix of features, Sprega. And Sr pa (1,5) presents the similarity be- 
tween features f; and fj. Given Spa, we can calculate an embedding [14] for 
the features, which can be then used in Equation (6.8) to construct C. Below 
we study how to compute C from Sp ga in detail. 

Given a feature similarity matrix Sp pga, we first construct a commute time 
embedding [188]. The commute time embedding preserves commute distance 
that measures the expected time that takes for a random walk to travel from 
one vertex to another and back [117]. Tt has been shown effective in preserving 
similarities in the embedding space. Let L = D — S and L = USU! be the 


SVD of L, the embedding of the features is given by Xrga = (£+)? UT, 
where each column of X pg 4 corresponds to a feature vector f;. By transposing 
1 


Xprpga, we obtain the explicit expression of features: XW = U(2*)?. By 
substituting XW; in Equation (6.8), we can obtain the covariance matrix C. 
We use the following proposition to summarize how to compute C from Sp ga. 


Proposition 1 Given a feature similarity matrix S depicting similarity 
among features, using commute time embedding, the covariance matrix C is 
given by 


C=U (> = I (8+)? 117 (5) UT, L=D-S, L=UXU”. (6.9) 


6.2.1.2 KEES, KEEA > KSAM 


In real-world applications, some particular feature functions or certain 
types of feature interactions may be of interests according to a specific research 
purpose. For example, in gene selection for cancer study, genes with certain 
types of functions or genes participating in certain types of biological processes 
(e.g., genetic interactions) are of special interest to biologists. 


Given KEEA or KFEA and F, a set of feature functions, or I, a set of 
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feature interactions, data can be filtered by the features associated with F, or 
I: 
Xp = Ile, (X), Xi = Up, (X). (6.10) 


Here Fr and Fy are the features related to F and I, respectively, and II (-) is the 
projection operator. Using the filtered data Xp or Xr, we can obtain a pairwise 
sample similarity matrix S through any similarity measure. Since all features 
in Fp (or Fy) are related to the feature functions (or feature interactions) 
of interest, sample similarity matrix S should reflect the distribution under 
the influence of the functions (or the interactions). In case that the functions 
(or the interactions) are closely related to the target concept under study, 
the distribution will give us an insight of the target concept, and help us to 
select relevant features. Using features that are known to have a particular 
function or participate in a particular interaction as the seeds (or a initial set 
of selected features), we can select features that perform the same function or 
participate in the same interaction. 


6.2.2 MSFS: The Framework 


The above discussion paves the way to a framework for Multi-Source 
Feature Selection based on spectral feature selection: MSFS. The detail of 
the framework can be found in Algorithm 1, which consists of three major 
steps: (1) obtaining local sample similarity from each data source (Lines 1-3); 
(2) combining local sample similarity to construct a global sample similar- 
ity (Line 4); and (3) using the global sample similarity in a spectral feature 
selection algorithm to select features (Line 5). 


Algorithm 9: MSFS: Multi-Source Feature Selection 
Input: X,...Kz and X 

Output: Listp - the selected feature list 

forall the K; € (Kı... Kz) do 

2 Ñ construct S;, the local sample similarity; 


= 


3 obtain global sample similarity S from S1,..., SZ; 
4 feed S in to SPEC to select feature and form List p; 
5 return Listp; 


6.3 A Framework Based on Rank Aggregation 


One limitation of MSFS is that it replies on combining sample similarity, 
which restricts its flexibility in handling small-sample data. To address this 
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limitation we propose in [226] a general approach to systematically integrate 
different types of knowledge for Knowledge-Oriented multi-source feature se- 
lection, named KOFS. Figure 6.6 presents the major steps in the approach. 


IKnowledge Conversion L Il 
| external internal | 
knowledge knowledge l 

| 


a convert 7 
| 


l Feature Ranking K 


internal knowledge features ranking list 


ranking 


| Ranking Aggregation 


ranking lists final ranking list 


aggregation 


FIGURE 6.6: KOFS, a framework for knowledge-oriented multi-source fea- 
ture selection. 


Step 1: Knowledge conversion — Knowledge understandable for human 
beings may not be directly applicable in a learning model. Therefore, the first 
step is to convert different types of human or external knowledge to certain 
types of internal knowledge that can be used by gene selection algorithms. 
Assume we have L different external knowledge sources K{*,...,K¢"". For 
the ith external knowledge source, we can apply a conversion operator c; (-) to 
convert the external knowledge K¢*" to the corresponding internal knowledge 
Kint, and this allows us to formalize knowledge conversion with the following 
equation: 

kresek ai ele (6.11) 


Step 2: Feature ranking — Assume K sets of internal knowledge 
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KNOW,,..., KNOW ar is used to rank features, where KNOW, is defined as 
KNOW, = Km... kih, Let C; be a relevance criterion, F = {F}, ..., Fm} 


ii 
be a set of M features, and R; (-) be a feature ranking function. The task of 
feature ranking is to use the internal knowledge with the given criterion to 


rank the relevance of the features in F. This can be formulated as 
RE — R(KNOW,,C;,F). (6.12) 


Step 3: Rank aggregation — After obtained the K ranking lists, they 
need to be integrated to generate the final ranking to estimate the relevance 
of features. Let A (-) be an aggregating operator for ranking lists and C be an 
aggregation criterion, we use A (-) to aggregate the K ranking lists. This can 
be formulated as 

Fee E A estore E) (6.13) 


The final feature ranking list can be obtained by considering the ranking 
lists from all internal knowledge sources in either a supervised or an unsuper- 
vised fashion, depending upon how C is specified. 

KOFS and MSFS are different in the following ways: 


1. KOFS explicitly defines the concepts of external and internal knowledge, 
and organizes different types of knowledge into well-defined categories, 
while this is not the case for MSFS. 


2. In KOFS, the coefficient combination can be automatically learned, 
while this problem is not addressed in MSFS. 


3. KOFS is based on combining ranking lists, while MSFS relies on com- 
bining sample similarity, which restricts the model flexibility. 


4. MSFS forms a special case of KOFS, if we specify sample similarity, 
KAM „as the only representation for internal knowledge in KOFS, and 
converge all other representations to it. 


6.3.1 Handling Knowledge in KOFS 


Different types of external knowledge and internal knowledge need to be 
handled properly in KOFS. In Section 6.1, we studied how to categorize differ- 
ent types of external knowledge sources. Below we define different types of the 
internal knowledge that can be used with KOFS. We also show how to convert 
different types of external knowledge to corresponding internal knowledge. 


6.3.1.1 Internal Knowledge 


While defining internal knowledge, the following two issues should be con- 
sidered. First, the definition should ensure that certain types of external 
knowledge can be easily converted to its form. Second, it can be effectively 
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used to rank features. Based on these two considerations, in KOFS, we use 
the following types of knowledge: 


e Knowledge about samples: 


int, SAM 
— sample category, Ki 


— sample geometric pattern, ees 


e Knowledge about features: 


: int, FEA 
— feature relation, CREL 


— feature function, KOREA 

Here feature relation can refer to either the similarity among features or the 
interaction among features, since both types of knowledge provides us with the 
information about the relationship among the features. Later on, we will show 
how to propagate feature relevance on the network derived from ce 
KOFS is not restricted to the four types of internal knowledge defined above. 
As long as new knowledge can be used to rank features, it can be treated 
as a type of internal knowledge. This ensures the extendability of KOFS. In 
real-world applications, we found that most available external knowledge in 
feature selection for genetic analysis can be conveniently converted to one of 
the four types of internal knowledge. Next we show how to convert various 
types of external knowledge to internal knowledge. 


6.3.1.2 Knowledge Conversion 


Table 6.2 contains the information of how different types of external knowl- 
edge can be mapped to their corresponding internal knowledge. 


TABLE 6.2: The conversion of different types of external knowledge to in- 
ternal knowledge. 


xternal Knowledge nternal Knowledge 
Ext lk ledg Int lk ledg 
ext,SAM ext,FEA ext,FEA ext,FEA int, SAM 
Ksrm Keun > Esta > KINT Ksrm 
ext, FEA ext,FEA int,FEA 
Ksru > KINT KREL 
ext, FEA int,FEA 
run Krun 
ext, SAM int,SAM 
Kear Koar 
a ext, SAM int, SAM ext, SAM int, SAM 
că tatii e) e e e 
ext, int, ext, int, ext, int, 
Kram KaeL > KINT ‘ KREL „and Kun Krun are 
straightforward. For example, Kfir , the similarity among features, and 
Ro, the interaction among features, can be directly used to construct 


feature relation graphs, corresponding to KATR Also, we have studied 
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: ext,FEA int, SAM ext,FEA int,SAM ; 
the conversion of Korm > Ksry and Krun > Ksry in Sec- 


tion 6.2.1. Below, we show how to use different types of internal knowledge to 
rank features. 


6.3.2 Ranking Using Internal Knowledge 


Now that we have the various types of internal knowledge ready, we can 
study how to use them to rank features as well as how to combine various 
ranking lists to obtain a final list. 

The internal knowledge can be used to rank features in various ways. 


Selecting features using ae corresponding to traditional supervised 


feature selection algorithms, has been well studied. Given KS, features 
can also be ranked via spectral feature selection algorithms. Below we show 
how to rank features using the other two types of internal knowledge. 
6.3.2.1 Relevance Propagation with KERE EA 

Given KRE, the knowledge of feature relation, we can derive a graph 
G to depict the knowledge. Given a set of features F = {F;,..., F}, which 
are known to be relevant, we can propagate their relevance on the graph to 
nearby nodes. Assuming co is built from ee ee the knowledge of 
feature similarity, relevance propagation corresponds to the hypothesis that if 
a feature is relevant, the features, which are similar to it, may also be relevant. 
We can formulate the idea using the concept from random walk theory [117]. 
Assume Sp is the similarity matrix corresponding to G, which is derived from 


Coo The transition probability matrix is defined as 
Pr = D;'Sp, 
Dr = diag (ai dias 


dF = > Sr (i, k). 


Assuming r is the vector containing the initial relevance of features, the final 
relevance of features is given by 


r* = r+... + (APP r+... + (APF) r 
= (I-APp)7r. (6.14) 
In the above equation, (APr)* r corresponds to the relevance gained by genes 
after k steps of propagation, and 0 < A < 1 is the decay parameter, which is 
used to reduce the magnitude of the relevance when it is propagated from one 


node to another node. After obtained r*, features can be ranked according to 
their corresponding value in r*. 


: z int, FEA 
6.3.2.2 Relevance Voting with Kyn 


The functions of features are usually depicted by a controlled vocabulary. 
In this case, the terms can be regarded as the hyper features of the original 
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features. Let F; be the ¿-th feature in the feature list, its function can be 
obtained from co which is described by a vector v; = (vi1,.-.,Ui,7)- 
Here T is the total number of functions. And v; j = 1, if and only if the feature i 
is related to the function j, otherwise v; ¿ = 0. Assuming we know the relevance 
of all the functions, which is described by a vector rf" = (a suis ins the 


relevance of F; can be obtained by the following equation: 


T 
r= 5 vi, yx rin, (6.15) 
i=1 


The equation sums the relevance of all the functions related to the feature 
as its relevance score. rfyy can either be assigned by researchers according 
to their research purpose or learned automatically. In Section 6.4, we show 
an example on how to learn the relevance of feature functions from the data 
automatically. 


6.3.3 Aggregating Feature Ranking Lists 


Using different types of knowledge, we can obtain multiple lists that rank 
features in different ways. Aggregating these rankings has been studied as rank 
aggregation in both machine learning and information retrieval [149]. In this 
section, we propose a probabilistic model for rank aggregation. While existing 
rank aggregation algorithms treat different ranking lists equally in the com- 
bination process, e.g., the methods presented in [149], the proposed method 
is able to automatically learn a set of combination coefficients according to 
the importance of different ranking lists. This is achieved by maximizing the 
relevance likelihood of the features in a given feature set. When the set only 
contains features that are known to be relevant, the model achieves rank ag- 
gregation in a supervised way. When the set contains all features, it combines 
ranking lists in an unsupervised way. 

Let F; denote the i-th feature, 1 < i < M, and its rank in ranking list l 
be r¡;. We define the probability of F; to be relevant according to its rank in 
the ranking list l as 


1 1 M 1 
P(rii) = Bex (=) „B = Da, exp (+). 


In the equation, B is the normalization factor for the distribution. For defining 
the probability, the exponential function exp(-) is adopted to emphasize the 
top ranked features. Given L ranking lists R¡,..., Ry, let the prior probability 
of picking the l-th ranking list, R;, to rank features as m; with mı +...+7z = 1. 
m reflects the reliability of Ry. To construct a mixture model [15], for each 
feature F;, we introduce an L-dimensional latent variable Zz; = {2:,1,..., Zi L} 
indicating which ranking list is used to rank F;. That is, if F;’s rank is taken 
from its rank in Ri, then z;; = 1, and all other elements in z; are set to 
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0. Based on these definitions, we can formulate the joint likelihood of the 


relevance of a feature set G = (71,..., Fx} as 
K OL 
p(Fi,...,Fx,Z|Ri,--- Rr, 9) = [[ [P ria)". (6.16) 
i=1 l=1 
In Equation (6.16), Z is the set of latent variables, Z = (zii)KxL 
=(Z1,..., ZK). T = {11,..., TL} can be obtained by maximizing the joint 


likelihood specified in Equation (6.16) with an EM algorithm. 


6.3.3.1 An EM Algorithm for Computing 7 


Expectation maximization (EM) is a standard iterative approach for 
finding the maximum likelihood estimates of parameters in a probabilistic 
model [15]. The probabilistic model specified in Equation 6.16 can be solved 
by the EM approach in the following way: 

E Step. Assuming that m is known, the posterior distribution of Z takes 
the form 


P (Z|R1,--- ,Rr,G)a P (Z) P(G|ky,:--: , KL, Z) 
TT 0) ( ay i 
zi TTL (+ [mio (o; ) 
¿=11=1 


{mP lr). 


II 


a. Ze 
Ti zu 


o 
Il 
= 
~ 
Il 
= 


Using standard techniques, we can show that y; 1, the responsibility of Lı for 
F;, is given by 


TP (rii) 
TP (ria) 


Vt = E (zia) = (6.17) 


ce 


1 


J 


Ji ı can be used to determine the expectation of the complete log likelihood, 
which defines the Q function [15] as 


Q (0,0%) = E, (In P (G, Z/0)) 
K L 
= yii Unr + n P (1r1,)}. 
i=1 l=1 
M Step. Assuming that Z is known, we can find the O by maximizing 
the Q function under the constraint of 7; +...+az, = 1. And this leads to 
the equation for updating 7; as 


mY = EE (6.18) 
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The algorithm is guaranteed to converge as shown in [15]. After obtained 
Tr, the probability of F; to be relevant can be calculated by marginalizing the 
joint probability P (F;, Ri): 


L L 
P(A) = Ņ_P(F, Ri) =D P (AIR) P(Ri) 


L L 
= XO P (ri) P(Ri) = SP (ria) m. (6.19) 


The final feature ranking list can be obtained by ranking the obtained rele- 
vance likelihood of features. 


6.4 Experimental Results 


We empirically evaluate the effect of multi-source feature selection in the 
domain of genetic analysis. As we mentioned in Section 6.3, if we specify 
sample similarity, ea , as the only representation for internal knowledge in 
KOFS, MSFS actually forms a special case of KOFS. Therefore, the KOFS 
framework forms a more general case than the MSFS framework. Hence, in 
this section we focus on evaluating the performance of the KOFS framework 
for multi-source feature selection. 


6.4.1 Data and Knowledge Sources 
6.4.1.1 Pediatric ALL Data 


The data is obtained from the Gene Expression Omnibus (GEO).? The 
data contain the expression profiling of 4,670 genes in bone marrow from 18 
pediatric patients with acute lymphoblastic leukemia (ALL): 10 B-cell ALL, 
5 T-cell ALL, and 3 B-cell ALL with the MLL/AF4 chromosomal rearrange- 
ment. Each bone marrow is measured twice, resulting in 36 samples. The 
data provide insight into the pathogenesis of childhood acute lymphoblastic 
leukemia. 


6.4.1.2 Knowledge Sources 


Five different knowledge sources are used in the experiments: 


1. Sample category: 


Patients are assigned to one of the three classes, B-ALL, T-ALL, or ML- 


L/AF4. The sample category information forms one type of as 


3http://www.ncbi.nlm.nih.gov/geo. Access ID: GSE2604. 
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2. Gene expression: 


The expression profiles of genes are used to obtain sample pairwise sim- 


ilarity with Mahalanobis distance, forming one type of Ka 


3. Metabolic pathway: 


The 208 Homo sapiens metabolic pathways are obtained from the KEGG 
pathway repository [87]. Six ALL-related pathways, including B-CELL 
RECEPTOR pathway and T-CELL RECEPTOR pathway are selected by the 
biologist. These pathways form one type of the ka (gene function), 
and the genes involved in these pathways are used to filter data for 


: int, SAM 
calculating Kis 


4. Cancer-gene annotation: 


The cancer gene annotation data are obtained from three knowledge 
sources: IPA gene annotation,* NCI Gene-Cancer database [151], and 
Cancer Gene Census project.? The cancer gene annotation data form 


one type of ea Ace which is used to construct both KENOM and 


int, FEA 
KFUN 


5. Gene ontology (GO): 


We obtain the GO annotations for genes from the Gene Ontology 


Database [25]. The information forms one type of ce and one 


type of Koes (gene similarity). Kae is extracted from GO an- 
notation using an information content based measure proposed in [135]. 


The obtained ees is used to construct eee AM with Mahalanobis 


distance and eee for relevance propagation. 


6.4.2 Experiment Setup 


By using different types of knowledge, and their combinations we can ob- 
tain eight ranking lists. Detailed information of how these lists are obtained 
can be found in Tables 6.3 and 6.4. Among the eight lists, SPEC (y2) and 
Fisher score correspond to using the traditional unsupervised and supervised 
feature selection algorithms on microarray data to select genes, respectively. 
The other six ranking lists correspond to using one or two types of external 
knowledge to select genes. The eight lists are used as baselines in the experi- 
ment for comparison. For GO-REL-VOTE and GO-CAN-MAH, the relevance 
of a GO term is determined by Mean/Mau, where Mau denotes the number 
of the genes associated with the term and Mean denotes the number of the 
cancer-related genes associated to the term. 


*http://www.ingenuity.com/. 
Shttp://www-sanger.ac.uk/genetics/CGP/Census/. 
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The eight ranking lists are aggregated in three ways: 


1. KOFSporaa: 
Rank aggregation with Borda count [46]. 


2. KOFSprob: 


Rank aggregation using the probabilistic model proposed in Section 6.3.3 
with all genes. 


3. KOFSprob-SUP! 


Rank aggregation using the probabilistic model proposed in Section 6.3.3 
with only the acute lymphoblastic leukemia (ALL)-related genes. 


The Borda count [46] is a representative rank aggregation algorithm based 
on majority voting. It is used as a baseline in the experiment for comparison. 


6.4.3 Performance Evaluation 


To evaluate the performance of different methods, we use four evaluation 
criteria: (1) Accuracy: accuracy of INN achieved on the top ranked genes 
provided by different algorithms; (2) SiManno: the similarity between selected 
genes and the known ALL-related genes according to GO annotation; (3) 
Hitcanc; and (4) HIT eu. The last two are the counts of known cancer-related 
genes and ALL-related genes in the top ranked genes provided by methods. 

Among the four, Accuracy is the standard criterion for evaluating the 
statistical relevance of the selected genes. For genes that are related to the bio- 
logical process inducing different phenotypes, their expression patterns should 
be different on samples of different phenotypes. Therefore, using these genes in 
classification or clustering should result in high accuracy. However, due to the 
small sample problem in cDNA microarray analysis, genes that result in high 
accuracy may not be biologically relevant. The next three criteria, SiManno, 
Hitcanc, and HITieu, are designed to provide evidence of how many selected 
genes are biologically relevant according to literature. They form “evidence 
criteria” for performance evaluation. The hypothesis is that if a gene list re- 
sults in high accuracy and contains many genes that are biologically relevant 
according to literature, it indicates that (1) the corresponding algorithm can 
select biologically relevant genes; and (2) other genes in the list may also 
be biologically relevant. Achieving high values on the three evidence criteria 
with low accuracy indicates that genes do not show a discriminative expres- 
sion pattern with different phenotypes. Therefore, it requires high value on 
both accuracy and evidence criteria to confirm the biological relevance of a 
gene list. In the following, we compare ranking lists obtained using the tra- 
ditional gene selection algorithms, using one or two types of knowledge, and 
using multiple types of knowledge. 
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6.4.4 Empirical Findings 


Table 6.5 contains the results obtained from methods using different types 
of knowledge. 

First, in terms of accuracy, the gene lists obtained from the Fisher Score, 
KOFSBorda, KOFSprop, and KOFSp;oph_sup achieve good performance. High 
accuracy indicates that the genes in these lists are statistically relevant, since 
they can separate samples from different phenotypes. We also notice that 
compared with SPEC, GO-MAH achieved higher accuracy. Both SPEC and 
GO-MAH use Mahalanobis distance, but GO-MAH uses the gene covariance 
learned from GO-based gene similarity. This suggests that the strategy pro- 
posed in Section 6.2.1 is effective. 

Second, in terms of the three evidence criteria (Simanno, HIT cane, and 
HITieu), the two methods using K/h 22 (GO-REL-VOTE and GO-REL- 
PROP), and the two methods of KOFS (KOFSp,.4 and KOFSprob-sup) 
achieve good performance, while the Fisher score and the other ranking meth- 
ods do not perform well. This is reasonable, since in Simanno, HIT canc, and 
HIT. we actually use | to evaluate genes. As GO-REL-VOTE and 
GO-REL-PROP are provided with K///2%, it is understandable that they 
can achieve better performance. We notice that by using only the terms re- 
lated to cancer for learning gene similarity, GO-CAN-MAH achieves a bet- 
ter performance than GO-MAH according to the three evidence criteria. For 
the methods of KOFS, the two methods using the probabilistic model pro- 
posed in Section 6.3.3 achieve good performance. Compared with KOF'Sprob, 
KOFSp,ob—sup achieves better performance with the evidence criteria. This 
clearly suggests that the label information used in KOFSprob-sup helps. Both 
KOFSpro» and GO-REL-PROP generate gene lists that have strong support 
from evidence criteria. However, in terms of accuracy, GO-REL-PROP’s per- 
formance is about 20% lower than that of KOFSprop. To intuitively observe 
the expression pattern of genes in each list, we apply cluster analysis on the 
genes selected by the two algorithms. The obtained heatmaps are presented 
in Figure 6.7. Results show that although many genes selected by the GO- 
REL-PROP are reported to be leukemia related in other studies, most of 
these genes do not show discriminative expression patterns on the current 
data. When clustering data using these genes, samples of different phenotypes 
are mixed up. The fact suggests that these genes may not be related to the 
current study. As compared with GO-REL-PROP, we observe that the genes 
selected by KOFSprop show discriminative expression patterns and lead to 
good clustering performance. 

Last, considering both accuracy and evidence criteria, the experiment re- 
sults in Table 6.5 show that the traditional algorithms and the algorithms 
using just one or two types of knowledge can only achieve either high statisti- 
cal relevance, or strong support for evidence criteria, but not both. Compared 
with these algorithms, the algorithms derived from KOFS can achieve a high 
performance with both types of criteria. The results clearly demonstrate the 


165 


Multi-Source Spectral Feature Selection 


“EJI BDUOPIAO WO sytoddns SUOL)S 

pue ogm USI y30q SƏIMDƏI JS 9us3 e JO SDUPAS]OL [ROISOTOIG 9Y} ULIȚUOD O], 'Aoatioodsa. ‘soues PoyeŢoa erua] 

pue 199489 UMOUX JO SOTIEI YY oy) ore MILIH pue PHH “uoryezoUue QN 0} 3UIP1ODI8 Soules pII TIY mouy pue sous 
poyoa]es UdoMJoq ANIRŢIUIIS TeUOTJOUN oy} st °““PUITG "SUYOS 94} Aq popraord soues CG pue ‘og ‘QT do} aqy 3ursn soues 
Aq posone Á9emose paseVIOAR oT} SI HAV-DOV ‘Apeatjoodso.s “SUIUILIOSȚE yuereyip Aq poptaocid soues Qg pue ‘og ‘QT doy ayy 


uo poaorqoe Lomen aq} 0} puodsa1102 YG-QOV PUL ‘OE-OOV ‘OT-OOV ‘s0ueULIOJI0d poo’ ayeorpur sioquinu prog :270N 


PAI GZ 9924 Z6'0 16'0 16'0 76:0 dns ads JOM 
ZI TZ t969 6'0 76:0 v6'0 46'0 ALAS TOM 
[A 9 €ZLT 260 26'0 6'0 16'0 “PIS TOM 
I v 189 090 790 790 og'o ITL 
ST dd 8892 820 980 820 020 dOUd-THY-OD 
I ei 966% 110 980 €8'0 790 HVIN-NVO-OD 
0 € 622 820 980 080 690 HVIN-OD 
8 97 9892 020 £8'0 69'0 990 ALOA-TAYU-0D 
0 v 108 110 680 180 190 ITIA-4emyyeg 
rá 8 EZS 26°0 26°0 26°0 26°0 9.1098 19YSIH 
0 [A 162, 120 €3'0 99'0 v9'0 Odds 
"LIH on’? LIH ASES SY DIV 0S-O0V DE-DOV OI-OOV SUOHLIAJA ONDINVY 


‘SpoYyJoul JUoIoyIp Aq poyestouos syst] SUIȚULI UPS 107 UOSTIedUIOD sdUeULIO}I0g :¢°9 ATAV.L 


166 Spectral Feature Selection for Data Mining 


FIGURE 6.7: (SEE COLOR INSERT) Cluster analysis on the genes 
selected by KOFSprop (left) and GO-REL-PROP (right), respectively. The 
color lines on the bottom of the figure correspond to the samples from patients 
of B-cell ALL (blue), T-cell ALL (red), and B-cell ALL with the MLL/AF4 
chromosomal rearrangement (green), respectively. 


efficacy of the proposed integrative approach for identifying biologically rele- 
vant genes. 


6.4.5 Discussion of Biological Relevance 


In order to more closely examine the biological relevance of the selected 
genes, we perform some further study, in which our biologist collaborators 
examine the top 50 genes selected by KOFSprob-sup. The information of 
relevant genes is summarized in Table 6.6. The upper part of the table con- 
tains the genes whose relevance to leukemia has been confirmed by the lit- 
erature. The lower part of the table contains the genes, whose relevance is 
unknown but cannot be ruled out. Analysis of these genes may yield the dis- 
covery of new leukemia-related genes. 17 leukemia-relevant genes are selected 
by KOFSp;op—sup. This list involves several crucial genes, such as USP33, 
LMOI, TIMPI, TIMP2 and STAT5B, which play important roles in the 
leukemia-related tumorigenesis and may lead to different subtypes of acute 
lymphoblastic leukemia (ALL). For instance, USP33 is reported to be con- 
sistently over-expressed in B-ALL samples but not in T-ALL samples [138]. 
LMO1 is mapped to an area of consistent chromosomal translocation in chro- 
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mosome 11, disrupting it in T-cell ALL. The LMO1 gene family is also defined 
as a class of T-cell oncogenes [181]. TIMP1 and TIMP2, members of Tissue 
Inhibitor of Metallo-Proteinases, are found to be related to the infiltration 
of ALL leukemia cells into extramedullary organs [175]. STAT5B is a mem- 
ber of the Signal Transducers and Activator of Transcription (STAT). The 
dysregulation of the signaling pathways mediated by this protein may be the 
cause of the ALL and other human cancers [209]. Twelve genes are found to be 
possibly leukemia or cancer related due to the following reasons: (1) their func- 
tions are related to tumorigensis and cell cycle control (e.g., PPARA, TIMP4, 
and CDK4); (2) they have cAMP-dependence (PRKACA and PRKARIA); 
(3) they are transcription factors (BRD8 and NCOR1), whose expressions 
are closely related to other known ALL genes mentioned above; (4) they are 
known to have high expression in leukemia (e.g., SIVA). Recent research re- 
sults reveal a role of SIVA inactivation in leukemia-related tumorigenesis, pre- 
sumably through enhancing NF-kappaB-mediated anti-apoptotic activity [64]. 
The study of these genes may help identify new biomarkers that are crucial 
to leukemia tumorigenesis. 


6.5 Discussions 


In gene selection research, various types of knowledge can be used to as- 
sist gene selection. For instance, the authors in [4] propose using different 
types of knowledge about genes to calculate gene similarity, which is then 
used to identify genes that are closest to the given example genes. In [172] 
the authors focus on using gene sets, which are groups of genes that share 
common biological functions, chromosomal locations, or regulations to inter- 
pret the gene selection outputs. In [136], gene annotation is used for choosing 
the optimal gene ranking criterion. In [7], protein interaction, gene-disease as- 
sociation, and gene function annotation are used for choosing cancer-related 
genes. Gene selection approaches using gene regulatory network and gene on- 
tology are also studied in [104] and [140, 168], respectively. Since most existing 
work is designed for specific research purposes, they can only handle one or 
limited types of knowledge of the same category. For instance, the models 
proposed in [172, 4, 7] can only handle knowledge about genes (features), but 
not knowledge about samples. To address this limitation, we present an inte- 
grative approach in this chapter to systematically incorporate different types 
of knowledge in feature selection. 

In this chapter, we investigate a novel problem arising from the need to se- 
lect features on one data source given multiple additional information sources. 
We extend the proposed spectral feature selection frameworks to achieve 
multi-source feature selection based on similarity combination (MSFS), which 
is further extended to achieve knowledge-oriented multi-source feature se- 


Spectral Feature Selection for Data Mining 


168 


199U89 I9PPe[d '199ue9 [e)9910Ţ09 Lyde 103da9991 payearyoe-10yeojyroid aurostxo1ed Vuvdd oS 


‘“QUIOXÁUI “19DUPI [LI1IIODOUIIPR eyd]e “1 adA3 “£x10yemn391 'quspuadap-J WY? “oseury urogord VIYU VAYA 67 
Z poyermosse sisaus3oyeuiads ZV.LVdS ev 
2 uredreo LNdVO ot 
199U89 dryeoued 8 Sururejuoo UreuIOpouIoIq squad se 
103984 Suronpur-sisoydode ‘T VAIS IVAIS 9€ 
199U9 yseoiq “195ueo 98ISOIA Į 10ss91d91-09 103d9991 1eƏPNU THOON pe 
aroun, Areymud eyd]e “orApeyeo “quapuadap-4 WY? “eseury ugod VOVMUdA ZE 
g quou “y apejo “1oyrqryur asepridad urdios ZANTAHAS IE 
LCUOL3 ‘euLOURTOUL “EUIOISEȚQOUIOI p oseury yuepuadap-urpo ko TACO GG 
eulooIesomqy Z OSBUIA OUISOIAY TAAL EG 
PULOIA ‘IULI 3ses1q p 1oŅqryur esepridado]re39ur JAWIL TANIL 8 
(ZL) seu pajerey-eruexnsT [120d 
9101 Y + “19DULI JOPperq “eruroxno z soyrqryur esepryydedoyeyout_ ANIL CANIL 37 
"er UIO no ep ‘(daa/O) 101d Surpurq 109ueyu2/ IV VOD dddao ¿y 
DI0UI y + “199U89 35e91q “ertroxno Į OSeUTĂ SUISOLÁ] P9ye]9--Ssuj ILIA 6 
ƏIOU y + ‘IULI 9VeIsoId “eruroyno I Sojouroy sus3o9uo [era VUIOWAY} JUNU INBA LLMV LI 
eiow J] + ‘ULI Jepperq ‘erureyne T 10y1qrqur eseprydedojyeyour GILL Id NLL 9 
erun Y2 oseroysorpoydsoyd Wadd g 
910U1 Z + ‘IULI 3un] “eruoxno 7 ISPUIA outsoiAy uleyoid ZM LA GALd €I 
ƏIOU Z + ‘IODULI 4sevoiq “eruoxno (1109 *06d) 109d9991 ULILI9ISUeI) DUAL I 
SPUIOUIDIED UPIIEAO Erun DATVEJOȚIȚOId-Iue ‘T sues UOIeDOISULI) [[99- OLE 0 
sinounj Iəppejq “eruexno (edAy ysruur g “sisopro[Aure) urosos NSD 6 
eruoxno gg asepiydad oyroods urymbiqn eedsn 8 
9IOU1 y + “199UBI9 ISe91q “ervroxno ous3oouo unf NOC PA 
DI0UI y + ‘IULI ISVOIG “EIUIONA g uronoad Surpurq 10398] MOI 9MI[-UINSUT EddAol 9 
910U1 Z + “199U9 ISe91q “eruroyno qg uondriiosues jo I0yeArmoe pue JooNpsuesy [eUSIS dG.LV.Ls G 
Lerun uteyo1d Surpurq əseury IUISOLA] utgoid OHAL ddOUAL p 
9IO0U1 g + ‘IULI yseoiq “ervioxno € “03 pa3e90[sue1y ‘g yrunqns © ‘ureWOp JUNI 10398] SUTPUTq-9109 &ELEVAGO G 
(ENE) (1 _urjoquuoya) q ATU ureurop WIT IOWT I 
(AT) Pa3e]ay eruexna7 og 07 UMOUY souon 
sIsdue) PIPPPH | SUBEN DUDE) | joquiAg user | yuey 


“TTY otiyerpad ur S9JOI 10 SUOTJOUN] [TVOISOTOIG IY} 09 SUIPIOIDe MO PAMI oq JOUULI IDUPAD]OL 
[e91I30]01q 9SOYM sous gı SUTeJUOD red IOMOT YT, OMPI] 09 SUIPIOO’ POYEŢRI PIUIOSNA IQ 09 UMOTY 918 YOM souss JT 
sureyuoo red roddn aq y *dAS~I"dgyQy Aq peptaoid ysy ous ye doy əy} UI səuə3 JURADTOI ATTROTSO[OI SUL :9°9 ATAVIL 


Multi-Source Spectral Feature Selection 169 


lection (KOFS) through rank aggregation. We design and conduct exten- 
sive experiments to objectively and systematically evaluate the KOFS frame- 
work, in comparison with existing representative single-source feature selection 
methods. The affirmative results demonstrate that using multiple knowledge 
sources can help improve feature selection of the target data. As multi-source 
data become more common, feature selection using multi-source data will be 
in high demand in many real applications. 
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